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Abstract 

Approximate Newton methods are a standard optimization tool which aim to maintain 
the benefits of Newton’s method, such as a fast rate of convergence, whilst alleviating its 
drawbacks, such as computationally expensive calculation or estimation of the inverse Hes¬ 
sian. In this work we investigate approximate Newton methods for policy optimization in 
Markov decision processes (MDPs). We first analyse the structure of the Hessian of the 
objective function for MDPs. We show that, like the gradient, the Hessian exhibits useful 
structure in the context of MDPs and we use this analysis to motivate two Gauss-Newton 
Methods for MDPs. Like the Gauss-Newton method for non-linear least squares, these 
methods involve approximating the Hessian by ignoring certain terms in the Hessian which 
are difficult to estimate. The approximate Hessians possess desirable properties, such as 
negative definiteness, and we demonstrate several important performance guarantees in¬ 
cluding guaranteed ascent directions, invariance to affine transformation of the parameter 
space, and convergence guarantees. We finally provide a unifying perspective of key policy 
search algorithms, demonstrating that our second Gauss-Newton algorithm is closely re¬ 
lated to both the EM-algorithm and natural gradient ascent applied to MDPs, but performs 
significantly better in practice on a range of challenging domains. 
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1. Introduction 


Markov decision processes (MDPs) are the standard model for optimal control in a fully 
observable environment ( Bertsekasf 2010). Strong empirical results have been obtained 
in numerous challenging real-world optimal control problems using the MDP framework. 


This includes problems of non-linear control (Stengel 


Li and Todorov 


[20041 


Todorov 


and Tassa, 2009[ Deisenroth and Rasmussen, |2011t [Rawlik et al.[ |2012t [Spall and Cristion 


1998), robotic applications (Kober and Peters 2011[ Kohl and Stone, 20041 Vlassis et al. 
2009), biological movement systems ([Li 2006), traffic management (Richter et al., 2007 


Srinivasan et ah, 2006), helicopter flight control ([Abbeel et al. 2007), elevator scheduling 


(Crites and Barto, 1995) and numerous games, including chess (Veness et al., 2009), go 


(Geliy and Silver, 2008), backgammon (Tesauro, 1994) and Atari video games (Mnih et al, 


2015). 


It is well-known that the global optimum of a MDP can be obtained through methods 


based on dynamic programming, such as value iteration (Bellman, 1957) and policy iter¬ 
ation ( [Howard' 1960). However, these techniques are known to suffer from the curse of 
dimensionality, which makes them infeasible for most real-world problems of interest. As a 
result, most research in the reinforcement learning and control theory literature has focused 
on obtaining approximate or locally optimal solutions. There exists a broad spectrum of 


such techniques, including approximate dynamic programming methods (Bertsekas, 2010), 


tree search methods (Russell and Norvig, 2009; Kocsis and Szepesvari, 2006; Browne et al. 


2012 ), local trajectory-optimization techniques, such as differential dynamic programming 

(Jacobson and Mayne 

1970) and iLQG 

(Li and Todorov, 2006), and policy search methods 

(Williams 1992 Baxter and Bartlett 

2001 

Sutton et al., 2000; Marbach and Tsitsiklis 

2001; Kakade 2002; I 

Cober and Peters 

2011 

)■ 


The focus of this paper is on policy search methods, which are a family of algorithms 
that have proven extremely popular in recent years, and which have numerous desirable 
properties that make them attractive in practice. Policy search algorithms are typically 


specialized applications of techniques from numerical optimization (Nocedal and Wright 


2006; Dempster et al. 1977). As such, the controller is defined in terms of a differentiable 


representation and local information about the objective function, such as the gradient, is 
used to update the controller in a smooth, non-greedy manner. Such updates are performed 
in an incremental manner until the algorithm converges to a local optimum of the objective 
function. There are several benefits to such an approach: the smooth updates of the 
control parameters endow these algorithms with very general convergence guarantees; as 
performance is improved at each iteration (or at least on average in stochastic policy search 
methods) these algorithms have good anytime performance properties; it is not necessary 
to approximate the value function, which is typically a difficult function to approximate 
- instead it is only necessary to approximate a low-dimensional projection of the value 
function, an observation which has led to the emergence of so called actor-critic methods 


(Konda and Tsitsiklis, 2003, 1999; Bhatnagar et al. 2008 2009); policy search methods are 


easily extendable to models for optimal control in a partially observable environment, such 
as the finite state controllers (Meuleau et al., [1999 Toussaint et al., 2006). 


In (stochastic) steepest gradient ascent (Williams, 1992 Baxter and Bartlett, 2001 


Sutton et al., 2000) the control parameters are updated by moving in the direction of 
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the gradient of the objective function. While steepest gradient ascent has enjoyed some 
success, it suffers from a serious issue that can hinder its performance. Specifically, the 
steepest ascent direction is not invariant to rescaling the components of the parameter 
space and the gradient is often poorly-scaled, i.e., the variation of the objective function 
differs dramatically along the different components of the gradient, and this leads to a 
poor rate of convergence. It also makes the construction of a good step size sequence a 
difficult problem, which is an important issue in stochastic methods^ Poor scaling is a 
well-known problem with steepest gradient ascent and alternative numerical optimization 
techniques have been considered in the policy search literature. Two approaches that have 


proven to be particularly popular are Expectation Maximization (Dempster et ah, 1977) 


and natural gradient ascent (Amari, 1997, 1998; Amari et ah, 1992), which have both been 


successfully applied to various challenging MDPs (see Dayan and Hinton (1997); Kober and 


Peters (2009); Toussaint et al. (2011) and Kakade (2002); Bagnell and Schneider (2003) 


respectively). 

An avenue of research that has received less attention is the application of Newton’s 
method to Markov decision processes. Although Baxter and Bartlett ( 2001| provide such 
an extension of their GPOMDP algorithm, they give no empirical results in either Baxter 


and Bartlett (2001) or the accompanying paper of empirical comparisons (Baxter et al 


2001). There has since been only a limited amount of research into using the second order 
information contained in the Hessian during the parameter update. To the best of our 
knowledge only two attempts have been made: 


m 


Schraudolph et al. (2006) an on-line 


estimate of a Hessian-vector product is used to adapt the step size sequence in an on¬ 


line manner; in Ngo et al. (2011), Bayesian policy gradient methods (Ghavamzadeh and 


Engel, 2007) are extended to the Newton method. There are several reasons for this lack 


of interest. Firstly, in many problems the construction and inversion of the Hessian is too 
computationally expensive to be feasible. Additionally, the objective function of a MDP 
is typically not concave, and so the Hessian isn’t guaranteed to be negative-definite. As 
a result, the search direction of the Newton method may not be an ascent direction, and 
hence a parameter update could actually lower the objective. Additionally, the variance of 
sample-based estimators of the Hessian will be larger than that of estimators of the gradient. 
This is an important point because the variance of gradient estimates can be a problematic 
issue and various methods, such as baselines (Weaver and Tao, 2001} 


Greensmith et al. 


2004), exist to reduce the variance. 


Many of these problems are not particular to Markov decision processes, but are general 
longstanding issues that plague the Newton method. Various methods have been developed 
in the optimization literature to alleviate these issues, whilst also maintaining desirable 
properties of the Newton method. For instance, quasi-Newton methods were designed 
to efficiently mimic the Newton method using only evaluations of the gradient obtained 
during previous iterations of the algorithm. These methods have low computational costs, 
a super-linear rate of convergence and have proven to be extremely effective in practice. See 
Nocedal and Wright (2006) for an introduction to quasi-Newton methods. Alternatively, 


the well-known Gauss-Newton method is a popular approach that aims to efficiently mimic 
the Newton method. The Gauss-Newton method is particular to non-linear least squares 


1. This is because line search techniques lose much of their desirability in stochastic numerical optimization 
algorithms, due to variance in the evaluations. 
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objective functions, for which the Hessian has a particular structure. Due to this structure 
there exist certain terms in the Hessian that can be used as a useful proxy for the Hessian 
itself, with the resulting algorithm having various desirable properties. For instance, the pre¬ 
conditioning matrix used in the Gauss-Newton method is guaranteed to be positive-definite, 
so that the non-linear least squares objective is guaranteed to decrease for a sufficiently small 
step size. 

While a straightforward application of quasi-Newton methods will not typically be pos¬ 
sible for MDP^ in this paper we consider whether an analogue to the Gauss-Newton 
method exists, so that the benefits of such methods can be applied to MDPs. The specific 
contributions are as follows: 

• In Section we present an analysis of the Hessian for MDPs. Our starting point is 
a policy Hessian theorem (Theorem and we analyse the behaiviour of individual 
terms of the Hessian to provide insight into constructing efficient approximate New¬ 
ton methods for policy optimization. In particular we show that certain terms are 
negligible near local optima. 

• Motivated by this analysis, in Section we provide two Gauss-Newton type methods 
for policy optimization in MDPs which retain certain terms of our Hessian decom¬ 
position in the preconditioner in a gradient-based policy search algorithm. The first 
method discards terms which are negligible near local optima and are difficult to ap¬ 
proximate. The second method further discards an additional term which we cannot 
guarantee to be negative-definite. We provide an analysis of our Gauss-Newton meth¬ 
ods and give several important performance guarantees for the second Gauss-Newton 
method: 


— We demonstrate that the pre-conditioning matrix is negative-definite when the 
controller is log-concave in the control parameters (detailing some widely used 
controllers for which this condition holds) guaranteeing that the search direction 
is an ascent direction. 

— We show that the method is invariant to affine transformations of the parameter 
space and thus does not suffer the significant drawback of steepest ascent. 

— We provide a convergence analysis, demonstrating linear convergence to local 
optima, in terms of the step size of the update. One key practical benefit of this 
analysis is that the step size for the incremental update can be chosen indepen¬ 
dently of unknown quantities, while retaining a guarantee of convergence. 

— The preconditioner has a particular form which enables the assent direction to be 
computed particularly efficiently via a Hessian-free conjugate gradient method 
in large parameter spaces. 


2. In quasi-Newton methods, to ensure an increase in the objective function it is necessary to satisfy 
the secant condition (Nocedal and Wright 20061. This condition is satisfied when the objective is 


concave/convex or the strong Wolfe conditions are met during a line search. For this reason, stochas¬ 
tic applications of quasi-Newton methods has been restricted to convex/concave objective functions 
(Schraudolph et al. 20071. 
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• In Section we present a unifying perspective for several policy search methods. 
In particular we relate the search direction of our second Gauss-Newton algorithm 
to that of Expectation Maximization (which provides new insights in to the latter 
algorithm when used for policy search), and we also discuss its relationship to the 
natural gradient algorithm. 

• In Section we present experiments demonstrating state-of-the-art performance on 
challenging domains including Tetris and robotic arm applications. 


2. Preliminaries and Background 

In Section[2.1|we introduce Markov decision processes, along with some standard terminol¬ 


ogy relating to these models that will be required throughout the paper. In Section 2.2 


we 


introduce policy search methods and detail several key algorithms from the literature. 


2.1 Markov Decision Processes 

In a Markov decision process an agent, or controller, interacts with an environment over 
the course of a planning horizon. At each point in the planning horizon the agent selects an 
action (based on the the current state of the environment) and receives a scalar reward. The 
amount of reward received depends on the selected action and the state of the environment. 
Once an action has been performed the system transitions to the next point in the planning 
horizon, and the new state of the environment is determined (often in a stochastic manner) 
by the action the agent selected and the current state of the environment. The optimality 
of an agent’s behaviour is measured in terms of the total reward the agent can expect to 
receive over the course of the planning horizon, so that optimal control is obtained when 
this quantity is maximized. 

Formally a MDP is described by the tuple {S,A,D,P,R}, in which S and A are sets, 
known respectively as the state and action space, D is the initial state distribution, which 
is a distribution over the state space, P is the transition dynamics and is formed of the 
set of conditional distributions over the state space, {P(-|s, a)}(s,a)e5x.4) and R : S x A^ 
[0, i2max] is the (deterministic) reward function, which is assumed to be bounded and non¬ 
negative. Given a planning horizon, p[ € N, and a time-point in the planning horizon, 
t G Nh, we use the notation st and at to denote the random variable of the state and action 
of the time-point, respectively. The state at the initial time-point is determined by the 
initial state distribution, si ~ D{-). At any given time-point, t G Nh, and given the state 
of the environment, the agent selects an action, at ~ 7r(-|st), according to the policy vr. The 
state of the next point in the planning horizon is determined according to the transition 
dynamics, sj+i ~ P{-\at,st). This process of selecting actions and transitioning to a new 
state is iterated sequentially through all of the time-points in the planning horizon. At each 
point in the planning horizon the agent receives a scalar reward, which is determined by 
the reward function. 

The objective of a MDP is to find the policy that maximizes a given function of the 
expected reward over the course of the planning horizon. In this paper we usually consider 
the infinite horizon discounted reward framework, so that the objective function takes the 
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form 


( 1 ) 


t=l 


7 * ^R{st,at)]'K,D , 


where we use the semi-colon to identify parameters of the distribntion, rather than condi¬ 
tioning variables, and where the distribution of st and at, which we denote by pt, is given by 
the marginal at time t of the joint distribution over ai:t), where si^t = {si, S 2 , ■■■, st), 
ai-.t = {cLi,a 2 , denoted by 


p{si,t,ai:t-,Tr) := 7r(at|st)<^ P(sr+i|sr, Or) x 7r(a^|sr) 

The discount factor 7 G [0,1), in Q ensures that the objective is bounded. 

We use the notation ^t = (si, ai, S 2 , 02 ,s*, a^) to denote trajectories through the 
state-action space of length, t G N. We use ^ to denote trajectories that are of infinite 
length, and use H to denote the space of all such trajectories. Given a trajectory, ^ G H, we 
nse the notation R{^) to denote the total discounted reward of the trajectory, so that 


t=i 


Similarly, we use the notation tt) to denote the probability of generating the trajectory 
^ under the policy tt. 

We now introduce several functions that are of central importance. The value function 
w.r.t. policy vr is defined as the total expected future reward given the current state. 


00 

W('S) := ^ 

t=i 


7 * ^ii’(si,ot)|si = s;7r 


( 2 ) 


It can be seen that U{tt) = Es..^_d[14-(s)]. The valne function can also be written as the 
solution of the following fixed-point equation. 


V^(s) 


R{s, a) + 7lEs'~p(.|s,a) [K(s )] 


(3) 


which is known as the Bellman eqnation (Bertsekas 
w.r.t. policy tt is given by 


The state-action valne function 


(^TT ('5; tt) 


R{s, tt) -|- 7®s'~P(.|s,a) 



(4) 


and gives the valne of performing an action, in a given state, and then following the policy. 
Note that K-l^) = X)aeylFinally, the advantage function (Baird, 1993) 

^ 7 r(s, tt) . a'j Tji-(s), 


(Baird, 1993 


gives the relative advantage of an action in relation to the other actions available in that 
state and it can be seen that YlaeA '^(®k)^ 7 r('S, a) = 0, for each s G S. 
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2.2 Policy Search Methods 

In policy search methods the policy is given some differentiable parametric form, denoted 
7r(a|s;tc) with w the policy parameter, and local information, such as the gradient of the 
objective function, is used to update the policy in a smooth non-greedy manner. This pro¬ 
cess is iterated in an incremental manner until the algorithm converges to a local optimum 
of the objective function. Denoting the parameter space by W C M”, n G N, we write the 
objective function directly in terms of the parameter vector, i.e., 

OO 

Uiw)= ^ weW, (5) 

{s,a)£SxA i=l 

while the trajectory distribution is written in the form 


.H-l . 

p{ai-H,si-H;w) = p{aH\sH]w)< p(st+i|at,st)7r(at|st;i(;) Ui(si), HgN. (6) 

t=i ^ 


Similarly, y(s;ic), Q{s,a]w) and A{s,a]w) denote respectively the value function, state- 
action value function and the advantage function in terms of the parameter vector w. We 
introduce the notation 

OO 

p^{s, a; w) := ^ 7*“ V('S, a; w). (7) 

t=i 

Note that the objective function can be written 


U{w)= p^{s,a]w)R{s,a). 

{s,a)GSxA 


( 8 ) 


We shall consider two forms of policy search algorithm in this paper, gradient-based 
optimization methods and methods based on iteratively optimizing a lower-bound on the 
objective function. In gradient-based methods the update of the policy parameters take the 
form 

= w + aM.{w)VwU{w), (9) 

where a G M"*" is the step size parameter and Ai{w) is some preconditioning matrix that 
possibly depends on lu G W. If M(w) is positive-definite and a is sufficiently small, then 
such an update will increase the total expected reward. Provided that the precondition¬ 
ing matrix is always negative-definite and the step size sequence is appropriately selected, 
by iteratively updating the policy parameters according to ([^ the policy parameters will 
converge to a local optimum of ([^. This generic gradient-based policy search algorithm 
is given in Algorithm Gradient-based methods vary in the form of the preconditioning 
matrix used in the parameter update. The choice of the preconditioning matrix determines 
various aspects of the resulting algorithm, such as the computational complexity, the rate 
at which the algorithm converges to a local optimum and invariance properties of the pa¬ 
rameter update. Typically the gradient S/wU{w) and the preconditioner M.{w) will not 
be known exactly and must be approximated by collecting data from the system. In the 
context of reinforcement learning, the Expectation Maximization (EM) algorithm searches 


Algorithm 1: Generic gradient-based policy search algorithm 
Input: Initial vector of policy parameters, wq G W, and a step size sequence, 
with Ofc G M"'' for k G N. 

Set iteration counter, k ■(— 0. 

repeat 

Calculate the gradient of the objective V.w=w^.U(w), and the preconditioner 
A4(wk) at the current point in the parameter space. 

Update policy parameters, refc+i = -b akM.{wk)V{w). 

Update iteration counter, k k + 1. 

until Convergence of the policy parameters; 

return 


for the optimal policy by iteratively optimizing a lower bound on the objective function. 
While the EM-algorithm doesn’t have an update of the form given in ([^ we shall see in 
Section 5^ that the algorithm is closely related to such an update. We now review specific 
policy search methods. 


2.2.1 Steepest Gradient Ascent 

Steepest gradient ascent corresponds to the choice M.{w) = In, where In denotes the nx n 
identity matrix so that the parameter update takes the form: 


Policy search update using steepest ascent 


= w + aVwU{w). 

( 10 ) 


The gradient {w) can be written in a relatively simple form using the following theorem 
(Sutton et ah, 2000j ): 


Theorem 1 (Policy Gradient Theorem (Sutton et al., 2000)). Suppose we are given a 
Markov Decision Process with objective ^ and Markovian trajectory distribution For 
any given parameter vector, w G W, the gradient of 0 takes the form 


\iwU{w) = ^ Py{s, g; w)Q{s, a; log 7 r(a|s; w). 


( 11 ) 


Proof. This is a well-known result that can be found in Sutton et al. (2000). A derivation 
of ( |11[ ) is provided in Section A.l in the Appendix. □ 

It is not possible to calculate the gradient exactly for most real-world MDPs of interest. 
For instance, in discrete domains the size of the state-action space may be too large for 
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enumeration over these sets to be feasible. Alternatively, in continuous domains the pres¬ 
ence of non-linearities in the transition dynamics makes the calculation of the occupancy 
marginals an intractable problem. Various techniques have been proposed in the literature 


1952; 

Lohl and Stone, 2004; Tedrake and Zhang| 

2005), simultaneous perturbation methods 

(Spall 

1992 

Spall and Cristion 1996 

; Srinivasan et al. 

2006) and likelihood-ratio methods 

(Glym 

1 1986 

1990 

Williams, 1992, ' 

Baxter and Bart let 

2001 Konda and Tsitsiklis, 2003 

1999; 

Sutton et al. 

2000; Bhatnagar et al., 2009 

Kober and Peters, 2011). Likelihood-ratio 


methods, which originated in the statistics literature and were later applied to MDPs, are 
now the prominent method for estimating the gradient. There are numerous such methods in 


the literature, including Monte-Carlo methods ( 

Williams 

1992 

Baxter and Bartlett 

20011 

and actor-critic methods ( 

Konda and Tsitsiklis 

2003 

1999| Sutton et al. 

2000 

Bhatnagar 

et al. 

2009 

Kober and Peters, 2011). 


Steepest gradient ascent is known to perform poorly on objective functions that are 
poorly-scaled, that is, if changes to some parameters produce much larger variations to the 
function than changes in other parameters. In this case steepest gradient ascent zig-zags 
along the ridges of the objective in the parameter space (see e.g., Nocedal and Wright 


2006). It can be extremely difficult to gauge an appropriate scale for these steps sizes in 


poorly-scaled problems and the robustness of optimization algorithms to poor scaling is of 
significant practical importance in reinforcement learning since line search procedures to 
find a suitable step size are often impractical. 


2.2.2 Natural Gradient Ascent 

Natural gradient ascent techniques originated in the neural network and blind source sep- 


aration literature ( 

Amari 

1997, 

1998; Amari et al. 

1996 

1992), and were introduced into 

the policy search literature in 

Kakade 

(2002 

). To address the issue of poor scaling, natural 


gradient methods take the perspective that the parameter space should be viewed with a 
manifold structure in which distance between points on the manifold captures discrepancy 
between the models induced by different parameter vectors. In natural gradient ascent 
M{w) = G~^{w) in ([^, with G{w) denoting the Fisher information matrix, so that the 
parameter update takes the form 


Policy search update using natural gradient ascent 


= w + aG~^{w)VwU{w). 

(12) 


In the case of Markov decision processes the Fisher information matrix takes the form. 


G{w) = ^p^(s,a;w)V^V^log7r(a|s;u;) 


(13) 


which can then be viewed as a imposing a local norm on the parameter space which is 
second order approximation to the KL-divergence between induced policy distributions. 


When the trajectory distribution satisfies the Fisher regularity conditions (Lehmann and 
















































































































































by 


Casella, 1998) there is an alternate, equivalent, form of the Fisher information matrix given 
G(») = E E p-y{s, a; log7r(a|s; log7r(a|s; w). (14) 


There are several desirable properties of the natural gradient approach: the Fisher in¬ 
formation matrix is always positive-definite, regardless of the policy parametrization; The 


search direction is invariant to the parametrization of the policy, (Bagnell and Schneider 


2003 

Peters and Schaal 

2008) 

tor (Sutton et ah, 2000) within 


Additionally, when using a compatible function approxima- 
2000 ) within an actor-critic framework, then the optimal critic parameters 


coincide with the natural gradient. Furthermore, natural gradient ascent has been shown 


to perform well in some difficult MDP environments, including Tetris (Kakade, 2002) and 


several challenging robotics problems (Peters and Schaal, 2008). However, theoretically, the 


rate of convergence of natural gradient ascent is the same as steepest gradient ascent, i.e., 
linear, although, it has been noted to be substantially faster in practice. 

2.2.3 Expectation Maximization 

An alternative optimization procedure that has been the focus of much research in the 


planning and reinforcement learning communities is the EM-algorithm (Dayan and Hinton 


1997t Toussaint et ah 

2006, 2011 

Eurmston and Barber 

2009 

2010 

). 


2009, 2011; Hoffman et ah, 2009 


The EM-algorithm is a powerful optimization technique 


popular in the statistics and machine learning community (see e.g., Dempster et ah, 1977 


Little and Rubin 

2002 

Neal and Hinton 

1999 

) that has been successfully applied to a large 

number of problems. See 

Barber 

(2011) 

for a 

general overview of some of the applications 


of the algorithm in the machine learning literature. Among the strengths of the algorithm 
are its guarantee of increasing the objective function at each iteration, its often simple 
update equations and its generalization to highly intractable models through variational 


Bayes approximations (Saul et ah 1996). 


Given the advantages of the EM-algorithm it is natural to extend the algorithm to the 
MDP framework. Several derivations of the EM-algorithm for MDPs exist (Kober and 


Peters 2011 Toussaint et ah, 2011). For reference we state the lower-bound upon which 


the algorithm is based in the following theorem. 

Theorem 2. Suppose we are given a Markov Decision Process with objective & and Marko¬ 
vian trajectory distribution &■ Given any distribution, q, over the space of trajectories, E, 
then the following bound holds. 


log t/(ie) > Hgntropy(^(0) 


log (p(^;m)i?(0) 


Vm e W, 


(15) 


in which Fhentropy denotes the entropy function (Barber, 2011). 


Proof. The proof is based on an application of Jensen’s inequality and can be found in 
Kober and Peters ( 2011| ). □ 


The distribution, q, in Theorem is often referred to as the variational distribution. An 
EM-algorithm is obtained through coordinate-wise optimization of (15) with respect to the 
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variational distribution (the E-step) and the policy parameters (the M-step). In the E-step 
the lower-bound is optimized when q{^) oc p{^;w')R(^), in which w' are the current policy 
parameters. In the M-step the lower-bound is optimized with respect to w, which, given 
q{C) oc and the Markovian structure of logp(^;tu), is equivalent to optimizing 

the function, 


Q(w,w')= ^ p^{s,a;w')Q{s,a]w') 


log 7 r(a|s; w] 


(16) 


(s,a)SiSx^ 

with respect to the first parameter, w. The E-step and M-step are iterated in this manner 
until the policy parameters converge to a local optimum of the objective function. 


3. The Hessian of Markov Decision Processes 


As noted in Section the Newton method suffers from issues that often make its application 
to MDPs unattractive in practice. As a result there has been comparatively little research 
into the Newton method in the policy search literature. However, the Newton method has 
significant attractive properties, such as affine invariance of the policy parametrization and 
a quadratic rate of convergence. It is of interest, therefore, to consider whether one can 
construct an efficient Gauss-Newton type method for MDPs, in which the positive aspects 
of the Newton method are maintained and the negative aspects are alleviated. To this end, 
in this section we provide an analysis of the Hessian of a MDP. This analysis will then be 
used in Section]^ to propose Gauss-Newton type methods for MDPs. 


In Section [3.1| we provide a novel representation of the Hessian of a MDP, in Section 3.2 


we detail the definiteness properties of certain terms in the Hessian and in Section 3.3 


we 


analyse the behaviour of individual terms of the Hessian in the vicinity of a local optimum. 


3.1 The Policy Hessian Theorem 


There is a standard expansion of the Hessian of a MDP in the policy search literature (Baxter 


and Bartlett, 2001; Kakade, 2001, 2002) that, as with the gradient, takes a relatively simple 


form. This is summarized in the following result. 


Theorem 3 (Policy Hessian Theorem). Suppose we are given a Markov Decision Pro¬ 
cess with objective 0 and Markovian trajectory distribution 0. For any given parameter 


vector, w ^ W, the Hessian of ^ takes the form 

'H(tu) = Hiiw) -\- 'H2 {w) + (17) 

in which the matrices %i{w), 'H 2 {w) and 'Hi 2 {'w) can be written in the form 

%i{w) := ^ p-y{s, a] w)Q{s, a] w)^/ w ^ogTT{a\s; w)V^log7r{a\s] w) , (18) 

'H2{w) := ^'^p-f{s,a;w)Q{s,a;w)V^^V]^log7r{a\s;w), (19) 

'Hi2{w) := EE p-y{s, a; w)Vw log7r(a|s; w)V^Q{s, a; w). (20) 
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Proof. A derivation for a sample-based estimator of the Hessian can be found in Baxter 


and Bartlett 

(2001 

). For ease of reference a derivation of (|17[) is provided in Section 

A.l 


the Appendix. 


m 

□ 


We remark that and % 2 {w) are relatively simple to estimate, in the same manner 

as estimating the policy gradient. The term Tiu^w) is more difficult to estimate since it 
contains terms involving the unknown gradient a; w) and removing this dependence 

would result in a double sum over state-actions. 

Below we will present a novel form for the Hessian of a MDP, with attention given to 
the term 1-Li{w) +'H 2 {w) in ( [l7| ), which will require the following notion of parametrization 
with constant curvature. 


Definition 1. A policy parametrization is said to have constant curvature with respect to 
the action space, if for each {s,a) G the Hessian of the log-policy, log 7 r(a|s; w), 

does not depend upon the action, i.e., 

ViuV!^log 7 r(a|s;m) = V^oV^ log 7 r(a'|s; m), Va,a' G A. 

When a policy parametrization satisfies this property the notation, Vu;V^ log 7 r(s; m), is 
used to denote log 7 r(a|s; m), for each a G .A. 

A common class of policy which satisfies the property of Definition is, 7 r(a|s;m) oc 
exp(m''~ 0 (a, s)), in which <f){a,s) is a vector of features that depends on the state-action 
pair, (a, s) G A, x 5. Under this parametrization, 

V^„V^log 7 r(a|s;m) = -Cov„/.^^(.|^.^) (<^(a', s), 0(a', s)), 

which does not depend on, a G A.. In the case when the action space is continuous, then the 
policy parametrization 7r(a|s; w, S) oc exp ( — ^(a — w~^(f}{s))~^{a — m"'~0(s))), in which 
(f) : S ^ M” is a given feature map, satisfies the properties of Definition with respect to 
the mean parameters, w G W. 

We now present a novel decomposition of the Hessian for Markov decision processes. 

Theorem 4. Suppose we are given a Markov Decision Process with objective 0 and Marko¬ 
vian trajectory distribution &■ For any given parameter vector, w G W, the Hessian of 
0 takes the form 

H{w) = Ai{w) -\- A. 2 (m) -I- Hi 2 {w) -|- hJ 2 {'w). (21) 

Where, 

A,i(m) := p.y(s, o; m)A(s, a; log' 7 r(a|s; log 7 r(a|s; m) 

A, 2 (m) := ^ p^{s, a] w)A{s, a; log IT{a\s]w). 

(s,a)£SxA 

When the curvature of the log-policy is independent of the action, then the Hessian takes 
the form 

H{w) = Ai{w)-\-H i 2 {w)-\-HJ 2 {w). (22) 
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Proof. See Section A.2 in the Appendix. 


□ 


We now present an analysis of the terms of the policy Hessian, simplifying the expansion 
and demonstrating conditions under which certain terms disappear. The analysis will be 
used to motivate our Gauss-Newton methods in Section [H 


3.2 Analysis of the Policy Hessian — Definiteness 

An interesting comparison can be made between the expansions and ( |21[ [22| ) in terms of 
the definiteness properties of the component matrices. As the state-action value function is 
non-negative over the entire state-action space, it can be seen that T-Li{w) is positive-definite 
for all w G W. Similarly, it can be shown that under certain common policy parametriza- 
tions % 2 {w) is negative-definite over the entire parameter space. This is summarized in the 
following theorem. 


Theorem 5. The matrix TL 2 i'w) is negative-definite for all w £ W if: 1) the policy is log- 
concave with respect to the policy parameters; or 2) the policy parametrization has constant 
curvature with respect to the action space. 


Proof. See Section A.3 in the Appendix. 


□ 


It can be seen, therefore, that when the policy parametrization satisfies the properties 
of Theorem]^ the expansion ( |17[ ) gives PL{w) in terms of a positive-definite term, PLi{w), 
a negative-definite term, % 2 {w), and a remainder term, 'Hi 2 {w) -P'Hl 2 {w), which we shall 
show, in Section 3.3, becomes negligible around a local optimum when given a sufficiently 
rich policy parametrization. In contrast to the state-action value function, the advantage 
function takes both positive and negative values over the state-action space. As a result, 
the matrices Ai(rr) and A 2 iw) in (21, 22) can be indefinite over parts of the parameter 
space. 


3.3 Analysis in Vicinity of a Local Optimum 

In this section we consider the term PLi 2 {w) -|- PLl 2 {w), which is both difficult to estimate 
and not guaranteed to be negative definite. In particular, we shall consider the conditions 
under which these terms vanish at a local optimum. We start by noting that 

'Hi2{w) = ^ p-f{s,a-,w)V y,\ogn{a\s\w)\/^ R{s,a) -F 7 s)V(s^; tc) , 

(s,a)giSx>t V s' / 

= 7 X] P-t{s,a-,w)V.u,'^ogTr{a\s-,w)^p{s'\a,s)Vl,V{s'-,w). (23) 

{s,a)GSxA s' 

This means that if V^oH(s';^c) = 0, for all s' G S, then 'Hi 2 {'w) 'HJ 2 {'w) = 0. It is 

sufficient, therefore, to require that V(s; w) = 0, for all s G 5, at a local optimum 

w* G W. We therefore consider the situations in which this occurs. We start by introducing 
the notion of a value consistent policy class. This property of a policy class captures the idea 
that the policy class is rich enough such that changing a parameter to maximally improve 
the value in one state, does not worsen the value in another state, i.e., when a policy class 
is value consistent, there are no trade-offs between improving the value in different states. 
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Definition 2. A policy parametrization is said to be value consistent w.r.t. a Markov 
decision process if whenever, 

eJVwV{s-,w) ^ 0, (24) 

for some s £ S, w £ W and i £ then \/s £ S it holds that either 

sign(e7ViuF(s; w)) = sign{eJVwV{s]w)), (25) 


or 

eJV^V{s;w) = 0. (26) 

Furthermore, for any state, s £ S, for which (2^ holds it also holds that 

eJVwT^{a\s-,w) = 0, ya £ A. 

The notation e* is used to denote the standard basis vector ofMF in which the component 
is equal to one, and all other components are equal to zero. 


Example. To illustrate the concept of a value consistent policy parametrization we now 
consider two simple maze navigation MDPs, one with a value consistent policy parametriza¬ 
tion, and one with a policy parametrization that is not value consistent. The two MDPs are 
displayed in FigureWalls of the maze are solid lines, while the dotted lines indicate state 
boundaries and are passable. The agent starts, with equal probability, in one of the states 
marked with an ‘S’. The agent receives a positive reward for reaching the goal state, which 
is marked with a ‘G’, and is then reset to one of the start states. All other state-action 
pairs return a reward of zero. There are four possible actions (up, down, left, right) in each 
state, and the optimal policy is to move, with probability one, in the direction indicated 
by the arrow. We consider the policy parametrization, TT{a\s;w) oc exp{w~^(f){s')), where s' 
denotes the successor state of state-action pair (s, a) and (f) is a feature map. We consider 
the feature map 0:5—7- {0,1}^ which indicates the presence of a wall on each of the four 
state boundaries. Perceptual aliasing (Whitehead, 1992) occurs in both MDPs under this 
policy parametrization, with states 2, 3 &: 4 aliased in the hallway problem, and states 4, 
5 & 6 aliased in McCallum’s grid. In the hallway problem all of the aliased states have the 
same optimal action, and the value of these states all increase/decrease in unison. Hence, 
it can be seen that the policy parametrization is value consistent for the hallway problem. 
In McCallum’s grid, however, the optimal action for states 4 &: 6 is to move upwards, while 
in state 5 it is to move downwards. In this example increasing the probability of moving 
downwards in state 5 will also increase the probability of moving downwards in states 4 & 
6 . There is a point, therefore, at which increasing the probability of moving downwards in 
state 5 will decrease the value of states 4 &: 6. Thus this policy parametrization is not value 
consistent for McCallum’s grid. 


We now show that tabular policies - i.e., policies such that, for each state s £ S, the 
conditional distribution 7 r(a|s; Wg) is parametrized by a separate parameter vector Wg £ M"'® 
for some G N - are value consistent, regardless of the given Markov decision process. 


Theorem 6. Suppose that a given Markov decision process has a tabular policy parametriza¬ 
tion, then the policy parametrization is value consistent. 
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(b) McCallum Grid 


Figure 1: (a) The hallway problem. Under the feature map, <p, states 2, 3 and 4 map to the 
the same feature, and the optimal policy is identical on these states, (b) McCallum’s grid. 
Under the feature map, cj), states 4, 5 and 6 map to the same feature, but now the optimal 
policy differs among these states. 


Proof. See Section A.4 in the Appendix. 


□ 


We now show that under a value consistent policy parametrization the terms 'Hi 2 {w) 
and T-LJ 2 {w) vanish near local optima. 


Theorem 7. Suppose that w* G W is a local optimum of the differentiable objective func¬ 
tion, U{w) = Es^pj(.)[U( s; le)]. Suppose that the Markov chain induced by w* is ergodic. 
Suppose that the policy parametrization is value consistent w.r.t. the given Markov decision 
process. Then w* is a stationary point ofV{s;w) for all s ^ S, and 'Hui'w*) = 'HJ 2 {w*) = 

0 


Proof. See Appendix |A.5| □ 

Furthermore, when we have the additional condition that the gradient of the value 
function is continuous in w (at w = w*) then %i 2 {w) + —>• 0 as re —)• w*. This 

condition will be satisfied if, for example, the policy is continuously differentiable w.r.t. the 
policy parameters. 

Example (continued). Returning to the MDPs given in Figurewe now empirically ob¬ 
serve the behaviour of the term 'Hi 2 {w) the policy approaches a local optimum 

of the objective function. Figure gives the magnitude of Ptniw) + PLi 2 {w), in terms of 
the spectral norm, in relation to the distance from the local optimum. In correspondence 
with the theory, jii 2 {w) + jil 2 {w) —)• 0 as tt) —)• w* in the hallway problem, while this is 
not the case in McCallum’s grid. This simple example illustrates the fact that if the feature 
representation is well-chosen and sufficiently rich the term 'Hi 2 {w) + jii 2 {'^) vanishes in 
the vicinity of a local optimum. 


4. Gauss-Newton Methods for Markov Decision Processes 


In this section we propose several Gauss-Newton type methods for MDPs, motivated by 
the analysis of Section The algorithms are outlined in Section 4.1 and key performance 


analysis is provided in Section 4.2 
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Figure 2: Graphical illustration of the logarithm of the spectral norm of 'Hi 2 {w) + 
and ^i(m) in terms of ||m — iu *||2 for the hallway problem (a) and McCallum’s grid (b). 
For the given policy parametrization ^.{w) = Ai{w) + T-Luiw) + 'HJ 2 {'w), so the plot 
displays the two components of the Hessian as the policy converges to a local optimum. 
As expected, in the hallway problem %i 2 {w) + ^72('if) —>■ 0 as m —)■ w*, and Ai(iu) 
dominates. In this example the magnitude of Ai{w) is roughly six hundred times greater 
than that of %i 2 {w) + 'Hl 2 {'^) when \\w — m *||2 ~ 0.003. Conversely, in McCallum’s grid 
”^ 12 (if) + T~(-i 2 {'^) 7 ^ 0 as m —)• m*. In fact, 'Hui'w) + 'HJ 2 {w) has larger magnitude than 
Ai(m) at w* in this example. 


4.1 The Gauss-Newton Methods 

The first Gauss-Newton method we propose drops the Hessian terms which are difficult to 
estimate, but are expected to be negligible in the vicinity of local optima. Specifically, it 


was shown in Section 3.3 that if the policy parametrization is value consistent with a given 
MDP, then ^ 12 (if) +^72( m) —)• 0 as m converges towards a local optimum of the objective 
function. Similarly, if the policy parametrization is sufficiently rich, although not necessarily 
value consistent, then it is to be expected that %i 2 {w) -|- H 72 (if) will be negligible in the 
vicinity of a local optimum. In such cases A.i(io) -|- A. 2 (if), as defined in Theorem]^ will 
be a good approximation to the Hessian in the vicinity of a local optimum. For this reason, 
the first Gauss-Newton method that we propose for MDPs is to precondition the gradient 
with M{w) = —(Ai(m) -|- A 2 {w))~^ in Q, so that the update is of the form: 


Policy search update using the first Gauss-Newton method 


m — q;(Ai(w)-|-A 2(if)) ^VwU{w) 


(27) 


When the policy parametrization has constant curvature with respect to the action space 
A 2 {w) = 0 and it is sufficient to calculate just (Ai(m))~^. 
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The second Gauss-Newton method we propose removes further terms from the Hessian 
which are not guaranteed to be negative dehnite. As was seen in Section 3.1 when the policy 
parametrization satisfies the properties of Theoremthen % 2 {w) is negative-definite over 
the entire parameter space. Recall that in Q it is necessary that A4{w) is positive-definite 
(in the Newton method this corresponds to requiring the Hessian to be negative-definite) to 
ensure an increase of the objective function. That 'H 2 {w) is negative-definite over the entire 
parameter space is therefore a highly desirable property of a preconditioning matrix, and for 
this reason the second Gauss-Newton method that we propose for MDPs is to precondition 
the gradient with A4(w) = —'H 2 (w)~^ in Q, so that the update is of the form: 


Policy search update using the second Gauss-Newton method 


= w — aT-L 2 {w) ^\7.wU{w). 


(28) 


We shall see that the second Gauss-Newton method has important performance guaran¬ 
tees including; a guaranteed ascent direction; linear convergence to a local optimum under 
a step size which does not depend upon unknown quantities; invariance to affine transfor¬ 
mations of the parameter space; and efficient estimation procedures for the preconditioning 
matrix. We will also show, in Section that the second Gauss-Newton method is closely 
related to both the EM and natural gradient algorithms. 

We shall also consider a diagonal form of the approximation for both forms of Gauss- 
Newton methods. Denoting the diagonal matrix formed from the diagonal elements of 
Mi(tu)-|-M 2 (i<^) and '}i 2 {w) by and respectively, then we shall consider 

the methods that use M.{w) = and M.{w) = —'D^^{w) in (j^. We call 

these methods the diagonal first and second Gauss-Newton methods, respectively. This 
diagonalization amounts to performing the approximate Newton methods on each parameter 
independently, but simultaneously. 


4.1.1 Estimation of the Preconditioners and the Gauss-Newton Update 
Direction 

It is possible to extend typical techniques used to estimate the policy gradient to estimate 
the preconditioner for the Gauss-Newton method, by including either the Hessian of the log- 
policy, the outer product of the derivative of the log-policy, or the respective diagonal terms. 


As an example, in Section B.l of the Appendix we detail the extension of the recurrent state 


formulation of gradient evaluation in the average reward framework (Williams, 1992) to the 
second Gauss-Newton method. We use this extension in the Tetris experiment that we 
consider in Section Given Ug sampled state-action pairs, the complexity of this extension 
scales as 0{nsn^) for the second Gauss-Newton method, while it scales as 0{nsn) for the 
diagonal version of the algorithm. 

We provide more details of situations in which the inversion of the preconditioning 


matrices can be performed more efficiently in Section B.2 of the Appendix. Finally, for the 
second Gauss-Newton method the ascent direction can be estimated particularly efficiently. 
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even for large parameter spaces, using a Hessian-free conjugate-gradient approach, which is 
detailed in Section B.3 of the Appendix. 


4.2 Performance Guarantees and Analysis 
4.2.1 Ascent Directions 

In general the objective ([^ is not concave, which means that the Hessian will not be 
negative-definite over the entire parameter space. In such cases the Newton method can 
actually lower the objective and this is an undesirable aspect of the Newton method. We now 
consider ascent directions for the Gauss-Newton methods, and in particular demonstrate 
that the proposed second Gauss-Newton method guarantees an ascent direction in typical 
settings. 


Ascent directions for the first Gauss-Newton method: As mentioned previously, 
the matrix Ai(rc) +A 2 {'w) will typically be indehnite, and so a straightforward application 
of the first Gauss-Newton method will not necessarily result in an increase in the objective 
function. There are, however, standard correction techniques that one could consider to 
ensure that an increase in the objective function is obtained, such as adding a ridge term to 
the preconditioning matrix. A survey of such correction techniques can be found in |Boyd| 
and Vandenberghe (2004). 


Ascent directions for the second Gauss-Newton method: It was seen in Theorem[5] 
that T-L 2 iw) will be negative-dehnite over the entire parameter space if either the policy is 
log-concave with respect to the policy parameters, or the policy has constant curvature 
with respect to the action space. It follows that in such cases an increase of the objective 
function will be obtained when using the second Gauss-Newton method with a sufficiently 
small step-size. Additionally, the diagonal terms of a negative-definite matrix are nega¬ 
tive, so that is negative-definite whenever 'H 2 {w) is negative-dehnite, and thus 

similar performance guarantees exist for the diagonal version of the second Gauss-Newton 
algorithm. 

To motivate this result we now briehy consider some widely used policies that are ei¬ 
ther log-concave or blockwise log-concave. Firstly, consider the Gibb’s policy, 7r(a|s;tu) oc 
exjpw'^cj>{a, s), in which cj){a,s) G M"' is a feature vector. This policy is widely used in 
discrete systems and is log-concave in w, which can be seen from the fact that log7r(a|s; w) 
is the sum of a linear term and a negative log-sum-exp term, both of which are concave 
(Boyd and Vandenberghe, 2004[ ). In systems with a continuous state-action space a com¬ 
mon choice of controller is 7r(a|s; AT, S) = N{a\K(f){s), S), in which <^(s) G M"' is a feature 
vector. This controller is not jointly log-concave in K and S, but it is blockwise log-concave 
in K and In terms of K the log-policy is quadratic and the coefficient matrix of the 

quadratic term is negative-dehnite. In terms of the log-policy consists of a linear term 
and a log-determinant term, both of which are concave. 


4.2.2 Affine Invariance 

A undesirable aspect of steepest gradient ascent is that its performance is dependent on 
the choice of basis used to represent the parameter space. An important and desirable 
property of the Newton method is that it is invariant to non-singular affine transformations 


19 















of the parameter space (Boyd and Vandenberghe, 2004). This means that given a non- 


singular affine mapping, T G the Newton update of the objective U{w) = U{Tw) is 

related to the Newton update of the original objective through the same affine mapping, 
i.e., V + Aunt = T(w + AtUnt), in which v = Tw and Au^t and Aw^t denote the respective 
Newton steps. A method is said to be scale invariant if it is invariant to non-singular 
rescalings of the parameter space. In this case the mapping T G is given by a 

non-singular diagonal matrix. The proposed approximate Newton methods have various 
invariance properties, and these properties are summarized in the following theorem. 


Theorem 8. The first and second Gauss-Newton methods are invariant to (non-singular) 
affine transformations of the parameter space. The diagonal versions of these algorithms 
are invariant to (non-singular) rescalings of the parameter space. 


Proof. See Section |A.6| in the Appendix. 


□ 


4.2.3 Convergence Analysis 


We now provide a local convergence analysis of the Gauss-Newton framework. We shall 
focus on the full Gauss-Newton methods, with the analysis of the diagonal Gauss-Newton 
method following similarly. Additionally, we shall focus on the case in which a constant 
step size is considered throughout, which is denoted by a G M^. We say that an algorithm 
converges linearly to a limit L at a rate r G (0,1) if 

\U(-WI,)-L\ 

If r = 0 then the algorithm converges super-linearly. We denote the parameter update 
function of the first and second Gauss-Newton methods by Gi and G 2 , respectively, so 
that Gi{w) = w — a{Ai{w) + A 2 {w))~^VU{w) and G 2 {w) = w — a'H 2 {w)~^VU{w). 
Given a matrix, A G L(MT) we denote the spectral radius of A by p{A) = maxj |Ai|, where 
are the eigenvalues of A. Throughout this section we shall use S/G{w*) to denote 
w\w=w* Giwf 

Theorem 9 (Gonvergence analysis for the first Gauss-Newton method). Suppose that w* G 
W is such that Vw\w=w*U{'^) = 0 and -|-Al2(ic*) is invertible, then Gi is Frechet 

differentiable at w* and VGi{w*) takes the form, 

VGi{w*) = 1- a{Ai{w*) + A 2 iw*))-^'H{w*). (29) 

IfTLiw*) and Ai{w*) -\- A 2 {w*) are negative-definite, and the step size is in the range, 

a G (0,2/p {{Ai{w*) + A 2 {w*))-^n{w*))) (30) 


then w* is a point of attraction of the first Gauss-Newton method, the convergence is at 
least linear and the rate is given by p{VGi{w*)) < 1. When the policy parametrization is 
value consistent with respect to the given Markov Decision Process, then (29) simplifies to 

VGi{w*) = {l-a)I, (31) 


and whenever a G (0, 2) then w* is a point of attraction of the first Gauss-Newton method, 
and the convergence to w* is linear if a 1 with a rate given by p{VGi{w*)) < 1, and 
convergence is super-linear when a = 1. 
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Proof. See Section A.7 in the Appendix. 


□ 


Additionally we make the following remarks for the case when the policy parametrization 
is not value consistent with respect to the given Markov decision process. For simplicity, 
we shall consider the case in which a = 1. In this case VGi(m*) takes the form, 

= -{Ai{w*) + A 2 {w*))-^ (nuiw*) + nj^iw*) 


From the analysis in Section 3.3 we expect that when the policy parametrization is rich, but 
not value consistent with respect to the given Markov decision process, that p{(fHi 2 {w*) + 

+^ 2 ('*a’*))) will generally be small. In this case the first Gauss- 
Newton method will converge linearly, and the rate of convergence will be close to zero. 

Theorem 10 (Convergence analysis for the second Gauss-Newton method). Suppose that 
w* ^ W is such that ^w\w=w*U{w) = 0 and 'H 2 {w*) is invertible, then G 2 is Frechet 
differentiable at w* and 'VG 2 {w*) takes the form, 


vg2{w*) = i - anfAw*)n{w*). 

If'H{w*) is negative-definite and the step size is in the range, 

a & {Q,2/p{n 2 {w*)-^n{w*))) 


(32) 


(33) 


then w* is a point of attraction of the second Gauss-Newton method, convergence to w* is 
at least linear and the rate is given by p{VG 2 {w*)) < 1. Furthermore, a G (0,2) implies 


condition (33). When the policy parametrization is value consistent with respect to the given 
Markov decision process, then (3^ ) simplifies to 

(34) 


VG 2 {w*) = I - anf\w*)Ai{w*). 


Proof. See Section A.7 in the Appendix. 


□ 


The conditions of Theorem [T^ look analogous to those of Theorem but they differ in 
important ways: it is not necessary to assume that the preconditioning matrix is negative- 


definite and the sets in (30) will not be known in practice, whereas the condition a G (0, 2) 


in Theorem 10 is more practical, i.e., for the second Gauss-Newton method convergence 
is guaranteed for a constant step size which is easily selected and does not depend upon 
unknown quantities. 

It will be seen in Section 15.21 that the second Gauss-Newton method has a close rela¬ 
tionship to the EM-algorithm. For this reason we postpone additional discussion about the 
rate of convergence of the second Gauss-Newton method until then. 


5. Relation to Existing Policy Search Methods 

In this section we detail the relationship between the second Gauss-Newton method and ex¬ 


isting policy search methods; In Section 5.1 we detail the relationship with natural gradient 


ascent and in Section 5.2 we detail the relationship with the EM-algorithm. 
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5.1 Natural Gradient Ascent and the Second Gauss-Newton Method 


Comparing the form of the Fisher information matrix given in (13) with (19) it can 


be seen that there is a close relationship between natural gradient ascent and the second 
Gauss-Newton method: in %2 there is an additional weighting of the integrand from the 
state-action value function. Hence, I-L 2 incorporates information about the reward structure 
of the objective function that is not present in the Fisher information matrix. 

We now consider how this additional weighting affects the search direction for natural 
gradient ascent and the Gauss-Newton approach. Given a norm on the parameter space, 
II • II, the steepest ascent direction at in G W with respect to that norm is given by. 


p = argmax|p,||p||=i} lim 


U{w + ap) — U{w) 


a 


Natural gradient ascent is obtained by considering the (local) norm || • ||g(io) given by 


w — w 


/||2 


:={w-w'yG{w){w-w') 


with G{w) as in (14). The natural gradient method allows less movement in the directions 


that have high norm which, as can be seen from the form of (14), are those directions that 


induce large changes to the policy over the parts of the state-action space that are likely 
to be visited under the current policy parameters. More movement is allowed in directions 
that either induce a small change in the policy, or induce large changes to the policy, but 
only in parts of the state-action space that are unlikely to be visited under the current 
policy parameters. In a similar manner the second Gauss-Newton method can be obtained 
by considering the (local) norm || • Wp^iw)-, 


\w — w 


/||2 




= —{w — w'Y % 2 {w){w — w'), 


SO that each term in (13) is additionally weighted by the state-action value function. 


Q{s,a',w). Thus, the directions which have high norm are those in which the policy is 
rapidly changing in state-action pairs that are not only likely to be visited under the cur¬ 
rent policy, but also have high value. Thus the second Gauss-Newton method updates the 
parameters more carefully if the behaviour in high value states is affected. Conversely, di¬ 
rections which induce a change only in state-action pairs of low value have low norm, and 
larger increments can be made in those directions. 


5.2 Expectation Maximization and the Second Gauss-Newton Method 

It has previously been noted ( jKober and Peters 2011) that the parameter update of steepest 
gradient ascent and the EM-algorithm can be related through the function Q defined in (16). 


In particular, the gradient © evaluated at wj^ can be written in terms of Q as follows, 

^ w\w=wii^ ^ w\w=w^.Qi'^ ^'^k)} 

while the parameter update of the EM-algorithm is given by, 

Wk+i = argmax.^gy^ Q{w,Wk). 
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In other words, steepest gradient ascent moves in the direction that most rapidly increases 
Q with respect to the first variable, while the EM-algorithm maximizes Q with respect 
to the first variable. While this relationship is true, it is also quite a negative result. It 
states that in situations in which it is not possible to explicitly maximize Q with respect to 
its first variable, then the alternative, in terms of the EM-algorithm, is a generalized EM- 
algorithm, which is equivalent to steepest gradient ascent. Given that the EM-algorithm 
is typically used to overcome the negative aspects of steepest gradient ascent, this is an 
undesirable alternative. It is possible to find the optimum of (16) numerically, but this 
is also undesirable as it results in a double-loop algorithm that could be computationally 
expensive. Finally, this result provides no insight into the behaviour of the EM-algorithm, 
in terms of the direction of its parameter update, when the maximization over w in (16) 
can be performed explicitly. 

We now demonstrate that the step-direction of the EM-algorithm has an underlying 
relationship with the second of our proposed Gauss-Newton methods. In particular, we show 
that under suitable regularity conditions the direction of the EM-update, i.e., lUfc-i-i ~ '^ki 
is the same, up to first order, as the direction of the second Gauss-Newton method that 
uses T-L 2 {w) in place of ^{w). 


Theorem 11. Suppose we are given a Markov decision process with objective Q and 
Markovian trajectory distribution &■ Consider the parameter update (M-step) of Expecta¬ 
tion Maximization at the iteration of the algorithm, i.e., 


Wk+i = argmax.^„gyy; Q{w,Wk). 


Provided that Q{w,Wk) is twice continuously differentiable in the first parameter we have 
that 


Wk+i-Wk = -'H2 iwk)V.u,\'w='w^:U{w) + 0{\\wk+i-Wk\n. (35) 

Additionally, in the case where the log-policy is quadratic the relation to the second Gauss- 
Newton method is exact, i.e., the second term on the r.h.s. of (35) is zero. 


Proof. See Section A.8 in the Appendix. 


□ 


Given a sequence of parameter vectors, generated through an application of 

the EM-algorithm, then lim^^oo — 'Wk\\ = 0. This means that the rate of convergence 

of the EM-algorithm will be the same as that of the second Gauss-Newton method when 
considering a constant step size of one. We formalize this intuition and provide the con¬ 
vergence properties of the EM-algorithm when applied to Markov decision processes in the 
following theorem. This is, to our knowledge, the first formal derivation of the convergence 
properties for this application of the EM-algorithm. 


Theorem 12. Suppose that the sequence, {wk}keN> is generated by an application of the 
EM-algorithm, where the sequence converges to w*. Denoting the update operation of the 
EM-algorithm by Gem, so that Wk+i = GemCk^A:); then 

VGem{w*) = I- nf\w*)n{w*). 


When the policy parametrization is value consistent with respect to the given Markov De¬ 
cision Process this simplifies to VGem('U^*) = I — Pi 2 {w*)~^Ai{w*). When the Hessian, 
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^{w*), is negative-definite then p{VGem{w*)) < 1 and w* is a local point of a attraction 
for the EM-algorithm. 


Proof. See Section |A.9| in the Appendix. 


□ 


6. Experiments 

In this section we provide an empirical evaluation of the Gauss-Newton methods on a varied 
set of challenging domains. 


6.1 AfRne Invariance Experiment 

In the first experiment we give an empirical illustration that the full Gauss-Newton methods 
are invariant to affine transformations of the parameter space. Additionally, we illustrate 
that the diagonal Gauss-Newton methods are invariant to (non-zero) rescalings of the di¬ 
mensions of the parameter space. We consider the simple two state example of Kakade 


(2002). In this example problem the policy has only two parameters, so that it is possible 


to plot the trace of the policy during training. The policy is trained using steepest gradient 
ascent, the full Gauss-Newton methods and the diagonal Gauss-Newton methods. We train 
the policy in both the original and linearly transformed parameter space. The policy traces 
of the various algorithms are given in Figure As expected steepest gradient ascent is 
affected by both forms of transformation, while the diagonal Gauss-Newton methods are 
invariant to diagonal rescalings of the parameter space, and the full Gauss-Newton methods 
are invariant to both forms of transformation. 


6.2 Cart-Pole Swing-Up Benchmark Experiment 

We also implemented the Gauss-Newton methods on the standard simulated Gart-pole 
benchmark problem. This problem involves a pole attached at a pivot to a cart, and by 
applying force to the cart the pole must be swung to the vertical position and balanced. 
The problem is under-actuated in the sense that insufficient power is available to drive the 
pole directly to the vertical position hence the problem captures the notion of trading off 
immediate reward for long term gain. In this episodic experiment we used an actor-critic 


architecture ( 

Konda and Tsitsiklis 

1999 

) using compatible features to fit the Q-function. 

We used the same simulator as 

Lagoudakis and Parr 

(2003 

), except here we allow con 


tinuous actions and choose a continuous reward signal. The state space is two dimensional, 
s = {9, 9) representing the angle (0 = 0 when the pole is pointing vertically upwards) and 
angular velocity of the pole. The action space is A = [—50, 50] representing the horizontal 
force in Newtons applied to the cart (i.e., any actions of greater magnitude returned by the 
controller are clipped at ±50). Uniform noise in [—10,10] is added to each action (before 
clipping). The system dynamics are 9t+i = 9t + ^t9t, 9t+i = 9t + ^t9t where 

■■ gs\ii{9) — am(.{9)‘^ sin(20)/2 — acos{9)u 
M / 3 — am£ cos"^ (9) ’ 

where g = 9.8m/s‘^ is the acceleration due to gravity, m = 2kg is the mass of the pole, 
M = 8kg is the mass of the cart, i = 0.5m is the length of the pole and a = l/(m ± M). 
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■ ■ 1 Steepest: Original ■ ■ ■ 2"“^ GN: Original * * * 1®^ DGN: Transform 

*** Steepest: Transform *** 2““^GN:Transform ■■■ 2'"^DGN:Original 

■ ■■ 1®* GN: Original p* DGN : Original *** 2'"^ DGN : Transform 

* * * 1®* GN : Transform 

Figure 3: Results from (a) the scale invariance experiment and (b) the affine invariance 
experiment. The plots show the trace of the policy through the parameter space during 
the course of training. The plots give the trace of the policy when trained in the original 
parameter space (square markers), and when trained in the transformed parameter space 
(star markers). For comparison, the policy traces in the transformed parameter space have 
been mapped back to the original space. The plots show the trace of the policy when 
the policy is trained with steepest gradient ascent (green), the first Gauss-Newton method 
(red), the second Gauss-Newton method (blue), the first diagonal Gauss-Newton method 
(purple) and the second diagonal Gauss-Newton method (black). 

We choose A* = 0.1s. Rewards R{s,a) = discount factor is 7 = 0.99, the 

horizon is H = 100, and the pole begins in the downwards position, sq = (vr,0). 

The controller is a Gaussian, 

' 7 r(a|s;ru) = Af {a\(f){s)~^w, a'^), 

with radial basis features, ())i(s) = exp |(ci — s)''~A(ci — s). For each separate experiment 
the 100 centers c* were drawn uniformly at random from [—vr, vr] x [—dvr, dvr], the bandwidth 

was fixed A = ^ J 1/d ) policy noise a was hxed at 2 (these parameters were 

found by an informal search). Controller weights wq were initialized randomly for each 
experiment. 

The policy was updated after every 10 trajectories, i.e., each iteration corresponds to 
10 episodes of experience. Of these, 5 trajectories were used to estimate the policy gradient 
and the preconditioning matrix, while the remaining 5 trajectories were used to learn an 


25 












approximation Q{s, a; w) = rp{s, a; w)^6 to the Q-function Q{s, o; w) using the compatible 


features (Kakade, 2002), 


■0(5, a; w) = Vw log7r(a|s; w) = —^{a — cl>{s)^ w)cf){s). 


0 “ 


The weight vector 0 was learnt using least-squares linear regression. For each (s^, Oi) in an 
experienced trajectory the targets were provided by Monte-Carlo roll-out estimates 


u 


Q{st,at\w) ~ ^7^ ^R{st+r-i-,at+r-i)- 


T=1 


Note that each trajectory was therefore simulated for a length 2H, rather than H, in order 
to gather the target data. A regularization parameter was validated on a held out subset 
of the data. 


We compared 5 algorithms: steepest ascent, ‘Steepest’, (10); the natural gradients algO' 


rithm, ‘Natural’, (12) with preconditioner A4(w) = G(w) compatible natural gradients, 
‘Comp Natural’, in which the policy parameter is updated in the direction 6 of the Q- 


function weight vector (Kakade, 2002); the first Gauss-Newton method, ‘First G-N’, (27) 
using Ai{w) = —the second Gauss-Newton method, ‘Second G-N’, (28) using 
Ai{w) = —'}i 2 {'w)~^. To precondition the gradients we solved the required linear systems 
using steepest descent using the gradient as a warm start, for a maximum of 250 iterations, 
rather than direct inversion. This was found to be more stable in this experiment than in¬ 
version of the preconditioning matrices for all methods since the Fisher information matrix 
and the (approximate) Hessians can be poorly conditioned: for example when the policy 
trajectories are supported entirely on a region of space in which some features are never 
active, neither the gradient, Hessian or Fisher information matrix will have any components 
corresponding to those feature dimensions. 

We used a step size of at = i_|_^"/ioo he.. 


wt+i = wt + atd{wt) 


where d{wt) is the search direction at iteration t. We ran the experiment 20 times over a 
range A G {1/4,1/2,1, 2,4,..., 512,1024, 2048} to choose the best step size for each method. 
The experiments were then run 50 times for the best step size to get the unbiased estimate 
of performance for that step size, which we report. After each policy update we estimated 
the cumulative reward of the policy (this requires no additional data, since the data used to 
estimate the return is exactly the data used to estimate the Q-function) and if the return 
was found to have decreased we returned to the previous parameter point. This simple 
heuristic (a 2-point line search) prevents variance in the gradient estimates from causing 
policy degradation and instability. 

Figure shows the cumulative reward after each iteration for the 5 methods along 
with the standard error. Cumulative reward of 50 is a near optimal policy in which the 
pole is quickly swung up and balanced for the entire episode. Cumulative reward of 40 to 
45 indicates that the pole is swung up and balanced, but either not optimally quickly, or 
that the controller is unable to balance the pole for the entire remainder of the episode. 
The Gauss-Newton methods significantly outperform all competitor methods both in terms 
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of the speed at which good policies are learned and the average value of the policy at 
convergence. Furthermore, as predicted by theory, a step-size of 1 for the Gauss-Newton 
methods was found to perform well; i.e., good performance could be obtained without 
step-size tuning. 


6.3 Non-Linear Navigation Experiment 


The next domain that we consider is the synthetic two-dimensional non-linear MDP consid¬ 
ered in Vlassis et al. (2009). The state-space of the problem is two-dimensional, s = (s^, s^). 


in which is the agent’s position and is the agent’s velocity. The control is one¬ 
dimensional and the dynamics of the system is given as follows, 


^t+i 


= si 


1 


l + e- 


— 0.5 -|- K, 


•St+l — Sf — O.lSj.,.;^ -I- K, 

with K a zero-mean Gaussian random variable with standard deviation = 0.02. The 
agent starts in the state s = (0,1), with the addition of Gaussian noise with standard 
deviation 0.001, and the obje ctive is for th e agen t to reach the target state, ^target = (0,0). 
We use the same policy as in Vlassis et al. (2009), which is given by at = {w + et)~^St, with 
control parameters, w, and e* ~ M{et',0,afiy The objective function is non-trivial for 
w G [0,60] X [—8,0]. In the experiment the initial control parameters were sampled from 
the region wq G [0, 60] x [—8, 0]. In all algorithms 50 trajectories were sampled during each 
training iteration and used to estimate the search direction. We consider a finite planning 
horizon, H = 80. The experiment was repeated 100 times and the results of the experiment 


are given in Figure 4b, which gives the mean and standard error of the results. The step 
size sequences of steepest gradient ascent, natural gradient ascent and the Gauss-Newton 
method were all tuned for performance and the results shown were obtained from the best 
step size sequence for each algorithm. 


6.4 V-link Rigid Manipulator Experiments 

The V-link rigid robot arm manipulator is a standard continuous model, consisting of an 


end effector connected to an Wlinked rigid body (Khalil, 2001). A graphical depiction of a 


3-link rigid manipulator is given in Figure A typical continuous control problem for such 
systems is to apply appropriate torque forces to the joints of the manipulator so as to move 
the end effector into a desired position. The state of the system is given by q, q, q € 
where q, q and q denote the angles, velocities and accelerations of the joints respectively, 
while the control variables are the torques applied to the joints r G M^. The nonlinear 

[2ok| , 


state equations of the system are given by (Spong et al. 


M{q)q + C{q, q)q + g{q) = r. 


(36) 


where M{q) is the inertia matrix, C{q,q) denotes the Goriolis and centripetal forces and 
g{q) is the gravitational force. While this system is highly nonlinear it is possible to define 
an appropriate control function T(q, q) that results in linear dynamics in a different state- 
action space. This technique is known as feedback linearisation ([Khalil 2001), and in the 
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(a) Cart-Pole Experiment: Results 


(b) Non-Linear Navigation Task : Results 


Figure 4: (a) Results from the cart-pole experiment, (b) Results from the non-linear navi¬ 
gation task, with the results for steepest gradient ascent (black), Expectation Maximization 
(blue), natural gradient ascent (green) and the Gauss-Newton method (red). 


case of an iV-link rigid manipulator recasts the torque action space into the acceleration 
action space. This means that the state of the system is now given by q and g, while 
the control is a = g- Ordinarily in such problems the reward would be a function of the 
generalized co-ordinates of the end effector, which results in a non-trivial reward function 
in terms of g, g and q. This can be accounted for by modelling the reward function as 


a mixture of Gaussians (Hoffman et al., 2009), but for simplicity we consider the simpler 


problem where the reward is a function of g, g and g directly. In all of the experiments in 
this section we consider a 3-link rigid manipulator. 

Under certain forms of policy parametrization it is possible to perform exact evaluation 
of the search direction in these systems. As such, these systems allow for the direct com¬ 
parison of the search direction of various policy search algorithms, but yet are sufficiently 
difficult optimization problems to provide a challenging platform for these methods. In all 
experiments we consider a policy of the form. 


7r(o|s; w) = J\f{a\Ks -|- m, a'^I), 


with w = {K,m,a) and s G M""*, a G M”“, for some Ug^ria G N. We consider the finite 
horizon undiscounted problem in this section, so that the gradient of the objective function 
takes the form 


VwU{w) 


dsdaVw log7r(a|s; w) 


H 

'^Pt{s,a-,w)Q{s,a,t;w), 

t=i 


with the preconditioning matrices of natural gradient ascent and the Gauss-Newton methods 
taking analogous forms. For any (s, a) G 5 x M, it can be shown that the derivative 
of 7r(a|s;m) is a quadratic in (s,a). This means that to calculate the search directions of 
steepest gradient ascent, natural gradient ascent. Expectation Maximization and the Gauss- 
Newton methods it is necessary to calculate the first two moments of pt{s, a; w)Q{s, a, t] w) 
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end effector 


Figure 5: A graphical depiction of a 3-link rigid manipulator, with the angles of the joints 
given by qi, q 2 and q^ respectively. 


w.r.t. {s,a), for each t G Nh- These calculations can be done using the methods presented 
in Furmston (2012). In these experiments the maximal value of the objective function 
varied dramatically depending on the random initialization of the system. To account for 
the variation in the maximal value of the objective function the results of each experiment 
are normalized by the maximal value achieved between the algorithms for that experiment 
so that the result displayed is the percentage of reward received in comparison to the best 
results among the algorithms considered in the experiment. 


6.4.1 Experiment Using Line Search 

In the first experiment we compare the search direction of steepest gradient ascent, natural 
gradient ascent, Expectation Maximization and the second Gauss-Newton method. For all 
algorithms that required the specification of a step size we use the minFuncj^ optimization 
library to perform a line search. We also use the minFunc library to provide a stopping 
criterion for all algorithms. We found that both the line search algorithm and the step size 
initialization had a significant effect on the performance of all algorithms. We therefore tried 
various combinations of these settings for each algorithm and selected the one that gave 
the best performance. We tried bracketing line search algorithms with: step size halving; 
quadratic/cubic interpolation from new function values; cubic interpolation from new func¬ 
tion and gradient values; step size doubling and bisection; cubic interpolation/extrapolation 
with function and gradient values. We tried the following step size initializations: quadratic 
initialization using previous function value, and new function value and gradient; twice the 
previous step size. To handle situations where the initial policy parametrization was in a 
‘flat’ area of the parameter space far from any optima we set the function and point tolera¬ 
tion of minFunc to zero for all algorithms. We repeated each experiment 100 times and the 
results are shown in Figure [6a| The second Gauss-Newton method significantly outperforms 
all of the comparison algorithms. The step direction of Expectation Maximization is very 
similar to the search direction of the second Gauss-Newton method in this problem. In fact, 
given that the log-policy is quadratic in the mean parameters, they are the same for the 


3. This software library is freely available at http://www.di.ens.fr/-msclimidt/Software/minFunc.html 
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(a) 3-Link Manipulator : Line Search Results (b) 3-Link Manipulator : Fixed step size Results 

Figure 6: Normalized total expected reward plotted against training time (in seconds) for 
the 3-link rigid manipulator, (a) The results from the line search experiment, with the plot 
showing the results for steepest gradient ascent (black), Expectation Maximization (blue), 
the second Gauss-Newton method (red) and natural gradient ascent (green), (b) The results 
from the fixed step size experiment, with the plot showing the results for steepest gradient 
ascent (black). Expectation Maximization (blue), the second Gauss-Newton method (red), 
natural gradient ascent (green). 


mean parameters. The difference in performance between the Gauss-Newton method and 
Expectation Maximization is largely explained by the tuning of the step size in the Gauss- 
Newton method, compared to the constant step size of one in Expectation Maximization. 
To observe the effect of poor scaling on the performance of the various algorithms we ob¬ 
serve the number of iterations that each algorithm requires. These counts are given in table 
Steepest gradient ascent required far more iterations than either natural gradient ascent 
or the Gauss-Newton method, both of which require roughly the same amount of iterations. 
This validates that both natural gradient ascent and the Gauss-Newton method are more 
robust to poor scaling than steepest gradient ascent. 

6.4.2 Experiment Using Fixed Step Size 

Line search as performed in the previous experiment is expensive to perform in practice, 
particularly in stochastic environments where many function evaluations may be required to 
obtain accurate function estimates. To obtain a gauge on the difficulty of selecting a step size 
sequence for the various policy search methods we again consider the 3-link manipulator, 
but now consider a fixed step size throughout training. This is a difficult problem for 
algorithms such as steepest gradient ascent because the parameter space has a non-trivial 
number of dimensions and the objective is poorly-scaled. In both steepest gradient ascent 
and natural gradient ascent we considered the following fixed step sizes: 0.001, 0.01, 0.1, 
1, 10, 20, 30, 100 and 250. We were unable to obtain any reasonable results with steepest 
gradient ascent with any of these fixed step sizes, for which reason the results are omitted. 
In natural gradient ascent we found 30 to be the best step size of those considered. In the 


30 




















(a) Tetris : Tetrzoids 



(b) Tetris : Game Board 


Figure 7: A graphical illustration of the game of tetris with (a) the collection of possible 
pieces, or tetrozoids, of which there are seven (b) a possible configuration of the board, 
which in this example is of height 20 and width 10. 


Gauss-Newton method we considered the following fixed step sizes: 10, 20, 30, 100 and 250 
and found that the fixed step size of 30 gave consistently good results without overstepping 
in the parameter space. The smaller step sizes obtained better results than Expectation 
Maximization, but less than the fixed step size of 30. The larger step sizes often found 
superior results, but would sometimes overstep in the parameter space. For these reasons 
we used the hxed step size of 30 in the hnal experiment. We repeated the experiment 100 
times and the results of the experiment are plotted in Figure 6b The results show that 


even though this step size tuning is crude it is still possible to obtain strong results in 
comparison to Expectation Maximization, which doesn’t require the selection of a step size 
sequence. In the experiment the Gauss-Newton method only took around 50 seconds to 
obtain the same performance as 300 seconds of training with Expectation Maximization. 
Furthermore Expectation Maximization was only able to obtain 40% of the performance of 
the Gauss-Newton method, while natural gradient ascent was only able to obtain around 
15% of the performance. The reason that natural gradient ascent performed so poorly in 
this problem was because the initial control parameters were typically in a plateau region 
of the parameter space where the objective was close to zero. To get out of this plateau 
region on a regular basis and in the given amount of training time would require on overly 
large step size. However, once in a high reward part of the parameter space we found that, 
using natural gradient ascent, these large step sizes would result in overshooting in the 
parameter space and poor performance. The step size of 30 was able to locate areas of high 
reward in a subset of the problems considered in the experiment, while not suffering from 
overshooting as much as the larger step sizes. The experiment highlights the robustness of 
the Gauss-Newton method to poor scaling, as well as the relative ease (in comparison to 
algorithms such as natural gradient ascent) of selecting a good step size sequence. 
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Figure 8: (a) Results from the Tetris experiment, with results for steepest gradient ascent 
(black), natural gradient ascent (green), the diagonal Gauss-Newton method (blue) and the 
Gauss-Newton method (red), (b) Results from the robot arm experiment, with results for 
the second Gauss-Newton method (red) and the EM-algorithm (blue). 


6.5 Tetris Experiment 

In this experiment we consider the Tetris domain, which is a popular computer game de¬ 
signed by Alexey Pajitnov in 1985. In Tetris there exists a board, which is typically a 20 x 10 
grid, which is empty at the beginning of a game. During each stage of the game a four block 
piece, called a tetrzoid, appears at the top of the board and begins to fall down the board. 
Whilst the tetrzoid is moving the player is allowed to rotate the tetrzoid and to move it 
left or right. The tetrzoid stops moving once it reaches either the bottom of the board or 
a previously positioned tetrzoid. In this manner the board begins to fill up with tetrzoid 
pieces. There are seven different variations of tetrzoid, as shown in Eigure [Taj When a 
horizontal line of the board is completely filled with (pieces of) tetrzoids the line is removed 
from the board and the player receives a score of one. The game terminates when the player 
is not able to fully place a tetrzoid on the board due to insufficient space remaining on the 
board. An example configuration of the board during a game of Tetris is given in Eigure [Tb) 


More details on the game of Tetris can be found in Eahey 

(2003 

1 . As in other applications 

of Tetris in the reinforcement learning literature ( 

Kakade 

2002 

Bertsekas and Ioffe 

1996) 


we consider a simplified version of the game in which the current tetrzoid remains above the 
board until the player decides upon a desired rotation and column position for the tetrzoid. 


1 1 Steepest Gradient Ascent 

1 Natural Gradient Ascent 

Gauss-Newton Method 

1 Iterations | 3684 ± 314 

1 203 ± 34 

310 ± 40 


Table 1: Iteration counts of the 3-link manipulator experiment for steepest gradient ascent, 
natural gradient ascent and the Gauss-Newton method when using the MinFunc optimiza¬ 
tion library. 
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Firstly, we compare the performance of the full and diagonal second Gauss-Newton 
methods to other policy search methods. Due to computational costs we consider a 10 x 
10 board in this experiment, which results in a state space with roughly 7 x states 
(Bertsekas and Ioffe, 1996). We model the policy using a Gibb’s distribution, and consider 
a feature vector with the following features: the heights of each column, the difference in 
heights between adjacent columns, the maximum height and the number of ‘holes’. This is 
the same set of features as used in Bertsekas and Ioffe (1996) &: Kakade (2002). Under this 
policy it is not possible to obtain the explicit maximum over w in (16), so a straightforward 
application of the EM-algorithm is not possible in this problem. We therefore compare 
the diagonal and full Gauss-Newton methods with steepest and natural gradient ascent. 
We use the same procedure to evaluate the search direction for all the algorithms in the 
experiment. Irrespective of the policy, a game of Tetris is guaranteed to terminate after 
a finite number of turns (Bertsekas and Ioffe 1996). We therefore model each game as 


an absorbing state MDP. The reward at each time-point is equal to the number of lines 
deleted. We use a recurrent state approach (Williams, 1992) to estimate the gradient, using 
the empty board as a recurrent state. (Since a new game starts with an empty board this 
state is recurrent]^ We use analogous versions of this recurrent state approach for natural 
gradient ascent, the diagonal Gauss-Newton method and the full Gauss-Newton method. As 


in Kakade (2002), we use the sample trajectories obtained during the gradient evaluation to 
estimate the Fisher information matrix. During each training iteration an approximation of 
the search direction is obtained by sampling 1000 games, using the current policy to sample 
the games. Given the current approximate search direction we use the following basic line 
search method to obtain a step size: For every step size in a given finite set of step sizes 
sample a set number of games and then return the step size with the maximal score over 
these games. In practice, in order to reduce the susceptibility to random noise, we used the 
same simulator seed for each possible step size in the set. In this line search procedure we 
sampled 1000 games for each of the possible step sizes. We use the same set of step sizes 


{O.l, 0.5,1.0, 2.0,4.0, 8.0,16.0, 32.0, 64.0,128.0}. 

in all of the different training algorithms in the experiment. To reduce the amount of noise 
in the results we use the same set of simulator seeds in the search direction evaluation 
for each of the algorithms considered in the experiment. In particular, we generate a 
^^experiments X ?T^iterations matrix of simulator seeds, with nexperiments the number of repetitions 
of the experiment and Uiterations the number of training iterations in each experiment. We 
use this one matrix of simulator seeds in all of the different training algorithms, with the 
element in the column and row corresponding to the simulator seed of the training 
iteration of the experiment. In a similar manner, the set of simulator seeds we use for 
the line search procedure is the same for all of the different training algorithms. Finally, to 
make the line search consistent among all of the different training algorithms we normalize 
the search direction and use the resulting unit vector in the line search procedure. We 

4. This is actually an approximation because it doesn’t take into account that the state is given by the 
configuration of the board and the current piece, so this particular ‘recurrent state’ ignores the current 
piece. Empirically we found that this approximation gave better results, presumably due to reduced 
variance in the estimands, and there is no reason to believe that it is unfairly biasing the comparison 
between the various parametric policy search methods. 
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ran 100 repetitions of the experiment, each consisting of 100 training iterations, and the 


mean and standard error of the results are given in Figure 8a It can be seen that the full 
Gauss-Newton method outperforms all of the other methods, while the performance of the 
diagonal Gauss-Newton method is comparable to natural gradient ascent. 

We also ran several training runs of the full approximate Newton method on the full-sized 
20 X 10 board and were able to obtain a score in the region of 14,000 completed lines, which 
was obtained after roughly 40 training iterations. An approximate dynamic programming 


based method has previously been applied to the Tetris domain in Bertsekas and Ioffe 
(1996). The same set of features were used and a score of roughly 4,500 completed lines 


was obtained after around 6 training iterations, after which the solution then deteriorated. 
More recently a modified policy iteration approach (Gabillon et ah, 2013) was able to 


obtain significantly better performance in the game of Tetris, completing approximately 51 
million lines in a 20 x 10 board. However, these results were obtained through an entirely 


different set of features, and analysis of the results in (Gabillon et ah, 2013) indicate that 


this difference in features makes a substantial difference in performance. On a 10 x 10 board 


using the same features as Bertsekas and Ioffe (1996) the approach of (Gabillon et al., 2013) 


was able to complete approximately 500 lines on average. 


6.6 Robot Arm Experiment 

In the final experiment we consider a robotic arm application. We use the Simulation 


Lab (Schaal 2006) environment, which provides a physically realistic engine of a Barrett 
WAM^^ robot arm. We consider the ball-in-a-cup domain ( |Kober and Peters , 2009), which 
is a challenging motor skill problem that is based on the traditional children’s game. In this 
domain a small cup is attached to the end effector of the robot arm. A ball is attached to 
the cup through a piece of string. At the beginning of the task the robot arm is stationary 
and the ball is hanging below the cup in a stationary position. The aim of the task is for 
the robot arm to learn an appropriate set of joint movements to first swing the ball above 
the cup and then to catch the ball in the cup when the ball is in its downward trajectory. 
The domain is episodic, with each episode 20 seconds in length. The state of the domain 
is given by the angles and velocities of the seven joints in the robot arm, along with the 
Gartesian coordinates of the ball. The action is given by the joint accelerations of the robot 
arm. We denote the position of the cup and the ball by (xc, Vc, Zc) £ and (x;,, yb, Zb) G 
respectively. The reward function is given by. 


r{xc,yc,Xb,yb,t) = 


-20{{xc - Xbf + (yc - yb)‘^) iit = tc, 

0 if t / tc, 


in which tc is the moment the ball crosses the z-plane (level with the cup) in a downward 
direction. If no such tc exists then the reward of the episode is given by —100. 


We use the motor primitive framework (Ijspeert et ah, 2002, 2003; Schaal et ah, 2007 


Kober and Peters 2011) in this domain, applying a separate motor primitive to each dimen¬ 


sion of the action space. Each motor primitive consists of a parametrized curve that models 
the desired action sequence (for the respective dimension of the action space) through the 
course of the episode. Given this collection of motor primitives the control engine within 
the simulator tries to follow the desired action sequence as closely as possible whilst also 
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satisfying the constraints on the system, such as the physical constraints on the torques that 
can safely be applied without damaging the robot arm. As in Kober and Peters (2011) we 
use dynamic motor primitives, using 10 shape parameters for each of the individual motor 
primitives. The robot arm has 7 joints, so that there are 70 motor primitive parameters in 
total. We optimize the parameters of the motor primitives by considering the MDP induced 
by this motor primitive framework. The action space corresponds to the space of possible 
motor primitives, so that A = M™. There is no state space in this MDP and the planning 
horizon is 1, so that this MDP is effectively a bandit problem. The reward of an action 
is equal to the total reward of the episode induced by the motor primitive. We consider a 
policy of the form, 

7r(a;m) = AA(o|/2, (LL*) ^), 

with w = {p,L), /X the mean of the Gaussian and LL* the Cholesky decomposition of the 
precision matrix. We consider a diagonal precision matrix, which results in a total of 140 
policy parameters. 

In this experiment we compare steepest gradient ascent, natural gradient ascent. Ex¬ 
pectation Maximization, the first Gauss-Newton method and the second Gauss-Newton 
method. As the planning horizon is of length 1 it follows that T-Li 2 {w) = 0, \/w G W, so 
that the first Gauss-Newton method coincides with the Newton method for this MDP. The 
policy is block-wise log-concave in /x and L, but not jointly log-concave in /x and L. As a 
result we construct block diagonal forms of the preconditioning matrices for the first and 
second Gauss-Newton methods, with a separate block for /x and L. Additionally, since the 
planning horizon is of length 1 it is possible to calculate the Fisher information exactly 
in this domain. For steepest gradient ascent and natural gradient ascent we considered 
several different step size sequences. Each sequence considered had a constant step size 
throughout, and the sequences differed in the size of this step size. We considered step sizes 
of length 1, 0.1, 0.01 and 0.001. For both Gauss-Newton methods we considered a fixed 
step size of one throughout training (i.e., no tuning of the step size sequence was performed 
for either the first or the second Gauss-Newton methods). As in Kober and Peters (2009) 
the initial value of /x is set so that the trajectory of the robot arm mimics that of a given 
human demonstration. The diagonal elements of the precision matrix are initialized to 
0.01. During each training iteration we sampled 15 actions from the policy and used the 
episodes generated from these samples to estimate the search direction. To deal with this 
low number of samples we used the samples from the last 10 training iterations when calcu¬ 
lating the search direction, taking the ‘effective’ sample size up to 150. Finally, we used the 
reward/fitness shaping approach of |Wierstra et ah (2014) in all the algorithms considered, 
using the same shaping function as in Wierstra et al. (2014). In each run of the experiment 
we performed 100 updates of the policy parameters. We repeated the experiment 50 times 
and the results are given in Figure |8b[ We were unable to successfully learn to catch the 
ball in the cup using either steepest gradient ascent, natural gradient ascent or the first 
Gauss-Newton method. For this reason the results for these algorithms are omitted. It can 
be seen that the second Gauss-Newton method significantly outperforms the EM-algorithm 
in this domain. Out of the 50 runs of the experiment, the second Gauss-Newton method 
was successfully able to learn to catch the ball in the cup 45 times. The EM-algorithm 
successfully learnt the task 36 times. As the log-policy is quadratic in /x and a fixed step 
size of one was used in the second Gauss-Newton method it follows that the update of /x in 
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the second Gauss-Newton method and the EM-algorithm are the same. The difference in 
performance can therefore be attributed to the difference in the updates of L between the 
two algorithms. 

7. Conclusions 

Approximate Newton methods, such as quasi-Newton methods and the Gauss-Newton 
method, are standard optimization techniques. These methods aim to maintain the benefits 
of Newton’s method, whilst alleviating its shortcomings. In this paper we have considered 
approximate Newton methods in the context of policy optimization in MDPs. The first con¬ 
tribution of this paper was to provide a novel analysis of the Hessian of the MDP objective 
function for policy optimization. This included providing a novel form for the Hessian, as 
well as detailing the positive/negative definiteness properties of certain terms in the Hessian. 
Furthermore, we have shown that when the policy parametrization is sufficiently rich then 
the remaining terms in the Hessian vanish in the vicinity of a local optimum. Motivated by 
this analysis we introduced two Gauss-Newton Methods for MDPs. Like the Gauss-Newton 
method for non-linear least squares, these methods involve approximating the Hessian by 
ignoring certain terms in the Hessian which are difficult to estimate. The approximate 
Hessians possess desirable properties, such as negative definiteness, and we demonstrate 
several important performance guarantees including guaranteed ascent directions, invari¬ 
ance to affine transformation of the parameter space, and convergence guarantees. We 
also demonstrated our second Gauss-Newton algorithm is closely related to both the EM- 
algorithm and natural gradient ascent applied to MDPs, providing novel insights into both 
of these algorithms. We have compared the proposed Gauss-Newton methods with other 
techniques in the policy search literature over a range of challenging domains, including 
Tetris and a robotic arm application. We found that the second Gauss-Newton method 
performed significantly better than other methods in all of the domains that we considered. 
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Appendix A. Proofs 

A.l Proofs of Theorems [l] and [3] 

We begin with an auxiliary Lemma. 

Lemma 1. Suppose we are given a Markov decision process with objective 0 and Marko¬ 
vian trajectory distribution ©■ For any given parameter vector, w G W, the following 
identities hold, 


OO 

V^^V{s;w) = EEE 7* ^p{st,at\si = s;w)Q{st,at;w)VwlogTr{at\st;w) (37) 
1 


OO 

VwQ{s,a]w) = EEE 7 * ^p{st,at\si = s,ai = a]w)Q{st,at;w)Vw^og7r{at\st;w). 

t=‘2 stG(S 


(38) 


Proof. We start by writing the value function in the form 


OO 

V{s;w) = EEE 7 * ^p{si:t,ai:t\si = s;w)R{st,at), (39) 

t=l Sl:t ai:t 


SO that, 


OO 

\iwV{s] w) = EEE 7* V('Si:i, ai:t|si = s; w)Vw logp{si:t, ai:t|si = s; w)R{st, at). 

t=l 0.l:t 


Using the fact that 


t 

logp(si:t,ai:t|si = s;w) = ^ V^„log7^(a^|s.r;^t'), (40) 

T=1 


we have that, 


OO t 

VwV{s]w) = EEEE 7 * ^p{sr,ar,st,at\si = s-,w)Vwlog7r{ar\sr;w)R{st,at) 

t=l St^dt T=1 StiCLt 

OO OO 

= EE7- ^p{sT,ar\si = s; ^r)V^o logvr(ar|sT-; re) EE 7* '^p{st,at\sr,ar]w)R{st,at) 

T—1 S'j-^CL'j- S-t^CLf 

OO 

= ^ ^ Y~^p{sT,ar\si = s;w)V.u,^og7r{ar\sr]w)Q{sr,ar;w). (41) 

T— 1 St iCLt 
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where in the second line we swapped the order of summation and the third line follows from 
the definition Q. Identity pS] ) now follows by applying (0: 

V-u,Q{s,a-,w) = 'y'^P{s'\s,a)V-u,V{s'-,w) 

s' 

oo 

= '^p{st,at\s 2 = s';w)Vw log7r(atlst; w)Q(st, at; w) 

s' t=2 sjGtS 

oo 

= EEE 7 * ^p(st,atlsi = s,ai = a; w)Q(st, at; w)Vw log 7r(atlst; w). 

t=2 st^S 


□ 


Theorem 0 Proof. Theorem follows immediately from Lemma by taking the expec¬ 
tation over Si w.r.t. the start state distribution pi and using the definition 0 of the 
discounted trajectory distribution. □ 

Theorem 0 Proof. Starting from 

OO 

U{w) = EEE 7 * ^p{si:t,ai.,t;w)R{st,at), 

t=l ai:t 

the Hessian of (0 takes the form 

oo 

V^VlU{w) = EEE 7* ^p(si:i, ai:t; logp(si:t, ai:i; w)R{st, at) 

t = l dlit 

OO 

EEE 7 * ^p{si:t,ai.t; logp(si:t, oi:*; m)V^ logp(si:t, ai:t;w)R{st, at). (42) 

t=l Sl:t ai:t 


Using the fact that ViuV.„, logp(si:t, ai:i|si = s;w) = V^oV.(„ log7r(aT-|sT-; m) we will 

show that the first term in (|42[) is equal to 'H 2 {w) as defined in (19): 


EEE 7* ^p{si:t, ai:t; w)VwVZ, logp{si:t, ai-t; w)R{st, at) 

t=l Si:t ai:t 

OO t 

= EEE 7 * ^p{si.,t, ai-t; w) log TT{ar\sr;w)R{st, at) 

t=l Si:t <^l:t T=1 

OO OO 



‘E 

p{st, 

1 'll) 

log7r(aT- 


r=l 

St iCLt 




t=T 

oo 

E^- 

‘E 

p{st, 

1 Q-r ; 'w 

log7r(aT- 

\sr;w)Q{s. 

r=l 

St 





R2{w) 







where in the third line we swapped the order of summation. 
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Using (40) we can write the second term in (42) as, 


EEE 7* ai:t]w)Vw logp{si:t, On*; m)V^ logp(si:t, ai:t; w)R{st, at) 

t=l Si:t ai:t 

CO t 

= EEEE 7 * ^pisi:t, ai:t] w)S/w log Tr{ar\Sr] w)S/l, logTT{ar\Sr;w)R{st, ttt) 

t=l T = 1 Slit ai:t 

oo t 

+ E E EE 7^ ai:t; ■»^)V^„ log 7r(a^^ |; m)log 7r( ®"r2 I 'Sr2 ]w)R{st,at). 

t=l TJ,T2 = 1 Sl;t ai;t 
Tl^T-2 

(43) 


By swapping the order of summation and following analogous calculations to those above, 
it can be shown that the first term in (43) is equal to Ri{w) as defined in (18). It remains 
to show that the second term in (43) is given by Ri 2 {w) +RJ 2 {'^)^ with Ri 2 {w) as given 
in (20). Splitting the second term in (43) into two terms. 


oo t 

E E EE 7* ^p{si:t, Out; w)Vw log7r(ari |sti ; log7r(ar2 |sr 2 ; w)i?(si, at) 

t = l T2,T2 = 1 Slit a,l:t 
t1^T2 

OO t T2 — 1 

= EEEEE 7* ^p{si-t, ai:t] w)Vw log 7r(ari |st-i ) log 7r(ar2 |sr 2 1 'w)R{st, at) 

t = l T2 = lri=l Sl:t ai:t 

OO t Ti — 1 

+ EE EEE 7* ^p(si:t, ai-r, w)Vw log7r(oT-i |sri; ■*")log 7r(ar2 |st 2 ; w)R{st, at), 

t = l T1 = 1t2 = 1 Slit Cll:t 

(44) 


we will show that the hrst term is equal to %i 2 {w). Given this, it immediately follows that 
the second term is equal to R^ 2 {'^)- Using the Markov property of the transition dynamics 
and the policy it follows that the first term in (44) is given by. 


OO t T2 —1 

EEE E ^p{sr^,ari;w)V^ log 7 r(a.ri|sri;if) 

t=l T2 = l Tl = l Stj ,ar-^^ 

X X] 7^^"^"p('Sr2,ar2|sTi,ari;tc)V(^log7r(ar2|sr2;tt)) '^-f^~'^^p{st,at\sr2,ar2-,'w)R{st,at). 

St2 >^T2 
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Rearranging the summation over t, ti and T 2 this can be rewritten in the form, 

OO 

Y^~^p{sTi,aTi]w)Vwlog7i:{ari\sri-,w) 

Tl = l 

^ OO 

X S X] X] 7^''“^^P(sr2,Or2l'Sri,ari;^t>)V^log7r(a^2|s^2;m) 

T2=T'i+1 Sx2 !<^T2 
OO 

EE 

t=T2 St,at 

OO 

= ^ ^ 7^^“V('Sri,aTi;i<^)Vi„log7r(ari|sTi;t«) 

Tl =1 !^T]^ 

OO 

X ^ ^ 7^2-rip(^^2, 0 ^- 2 |sri,a^i;m)V^ log7r(a^2|s^2;m)Q(s^2, 0 ^- 2 ;m) 

T2=Ti + 1 St2 ?^T2 
OO 

= ^ ^ Y^~^P{sTi,arj^]w)S/.u,logTT{ar^\Sri;w)'\/n,Q{Sr,^,ari]w) 

Ti =1 Stj^ 

= 'Hui'w) 


Where the penultimate line follows from (38). This completes the proof. 


□ 


A.2 Proof of Theorem |4] 

Recalling that the state-action value function takes the form, Q(s,a-,w) = V{s;w) + 
A(s, o; w), the matrices 'Hi{w) and 'H 2 {w) can be written in the following forms, 

'Hi(m) = Ai(m) -I- Vi(m), 'H2{w) = A2iw) + V2iw), (45) 


where, 

Ai(m) = p..y(s, a; m)A(s, a; ^r)V^o log 7r(a|s; log7r(a|s; m) 

{s,a)GSxA 

A2iw) = ^ p..y(s,a;^e)A(s,a;^r)V^oVC|,log7r(a|s;^^;) 

(s,a)ScSxyl 

Vi(te) = ^ p..),(s,a;m)R(s,a;m)V.u,log7r(a|s;m)vC(,log7r(a|s;m) 

(s,a)SiSx^ 

V 2 (ie) = ^ p^{s,a]w)V{s,a-,w)V^V]^\ogTT{a\s;w). 

(s,a)GSxA 

We begin with the following auxiliary lemmas. 

Lemma 2. Suppose we are given a Markov decision process with objective 0 and Marko¬ 
vian trajectory distribution 0- Provided that the policy satisfies the Fisher regularity con¬ 
ditions, then for any given parameter vector, w £ W, the matrices Vi{w) and V 2 {w) satisfy 
the following relation 

Vi(m) = -V2(m). (46) 
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Proof. As the policy satisfies the Fisher regularity conditions, then for any state, s € S, the 
following relation holds 

7r(a|s; w)S/w log7r(a|s; 'w)S/^ log7r(o|s; w) = — 7r(a|s; w)S/w^Zj log 7r(a|s; w). 

a&A aeA 


This means that Vi(m) can be written in the form 

Vi(m) = ''^p.y{s]w)V{s;w) ^ 7r(a|s;m)V^ log7r(a|s;m)V^ log7r(a|s; tu), 

= -'^p-y{s;w)V{s-,w) ^ 7r(a|s;m)V^„V^log7r(a|s;m) = -V 2 {w), 
sScS aeA 

which completes the proof. □ 

Lemma 3. Suppose we are given a Markov decision process with objective & and Marko¬ 
vian trajectory distribution If the policy parametrization has constant curvature with 
respect to the action space, then 

A2{w) = 0. (47) 

Proof. Recalling Definitionthe matrix A 2 {w) takes the form, 

A 2 {w) = ^ p-^{s, a; w)A{s, a; w)V.^V]^ log7r{s-,w), 

{s,a)&SxA 

= ''^pj{s] log7r(s; w) ^ 7r(o|s; w)A{s, a; w). 


The relation A2{w) = 0 follows because '^^^_^ 7 r{a\s] w)A{s, a] w) = 0, for all s £ S. □ 
Lemmas along with the relation (45), directly imply the result of Theorem 


A.3 Proof of Theorem [5] and Definiteness Results 

Theorem Proof. The first result follows from the fact that when the policy is log- 
concave with respect to the policy parameters, then % 2 {w) is a non-negative mixture of 
negative-definite matrices, which again is negative-definite (Boyd and Vandenberghe, 2004). 

The second result follows because when the policy parametrization has constant cur¬ 
vature with respect to the action space, then by Lemma in Section [A.2| A 2 {w) = 0 , 
that 

PL2{w) = A2{w) V2{w) = V2{'w) = -Vl{w), 

with 


so 


Vi(m) = ^ a; m)l/(s, a; m)V.u, log7r(a|s; log7r(a|s; tu) 

(s,a)SiSxyl 

V2{w) = ^ p-,{s,a■,w)V{s,a■,w)V^^V]^log7^{a\s;w). 

{s,a)GSxA 

The result now follows because — Vi(m) is negative-definite for all w £ W. □ 
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Lemma 4. For any w € W the matrix, 


nniw) = T-Liiw) +ni2{w) + 'Hj2i'w) 


is positive-definite. 


Proof. This follows immediately from the form of 'Hfiw) +%i 2 {w) +'P^ 2 {'^) given by (43) 
in Theorem which is positive-definite since the reward function is assumed positive. □ 


A.4 Proof of Theorem [6] 

We first prove an auxiliary lemma about the gradient of the value function in the case of 
a tabular policy. As we are considering a tabular policy we have a separate parameter 
vector Wg for each state s € S. We denote the parameter vector of the entire policy by 
w, in which this is given by the concatenation of the parameter vectors of the different 
states. The dimension of w is given by n = rig. In order to show that tabular policies 

are value consistent we start by relating the gradient of P(s; w) to the gradient of P(s; w), 
where the gradient is taken with respect to the policy parameters of state s, while the policy 
parameters of the remaining states are held hxed. 

Lemma 5. Suppose we are given a Markov decision process with a tabular policy such that 
V{s;w) is differentiable for each s € S. Given s,s £ S, such that s s, then we have that 

'^wsV{s;w) =phit{s s)V.u,sV{s;w), (48) 


where the notation Vws^is;w) is used to denote the gradient of the value function w.r.t. 
the policy parameter of state s, with the policy parameters of all other states considered 
fixed. The term Phit(s —> s) in (48) is given by 

OO 

Phit{s s) = = s|si = S,Sr S,T = 1, ...,t - 1] w). 

t=2 


Furthermore, when Markov chain induced by the policy parameters is ergodic ffien phit > 0. 


Proof. Given the equality 


V{s;w) = 7r(a|s; w)Q{s, a; w), 

aeA 


we have that 

E Vto^7r(a|s; w)Q{s, a; w) + 7r(o|s; w)VwsQ{s, a; w) 
aeA 

As the policy is tabular and s / s we have that Vu;^7r(a|s; w) = 0, so that this simplifies to 

V^„^-P(s;^e) = 7r(a|s; w)V^„-Q{s, a; w). 

a&A 
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Using the fact that Q{s, o; w) = R{s, a) + 7 Yls'es have 

Vn„V{s-w) = j'^p{s'\s;w)V^-V{s'-,w) 
s'e-S 

= -fp{s\s-,w)V^^V{s-,w) + -f'^p{s'\s-,w)Vw^V{s'-,w). (49) 

s'es 

s'j^s 


Applying equation (49) recursively gives 


Vw-y{s-,w) = ^7* ^p{st = s|si = s,Sr^ s,T = 1, ...,t - 1] w)V [s] w) 

t=2 

= PMt{ss) V^,V{s-,w), (50) 

which completes the proof. The probability, p{st = s|si = s, s,- / s, t = 1, ...,t — l;w), is 
equivalent to the probability that the first hitting time (of hitting state s when starting in 
state s) is equal to t. The strict inequality, Phit(s —)• s) > 0, follows from the ergodicity of 
the Markov chain induced hy w. □ 

We are now ready to prove Theorem 

Theorem Proof. Suppose that there exists i G Nn, w G W and s G S such that 

eJVwV{s;w) / 0 , 

for some s G S. As the policy parametrization is tabular, then the component of w 
corresponds to a policy parameter for a particular state, s G S. From Lemma it follows 
that 

f-v{s-,w) = ^ s] Ay(s;™), 

for all s G S. It follows that for states, s G S, for which Phit('S —> s) > 0 that we have 

sign(e7 (s; w)) = sign(e7 V^„U (s; w)), 
while in states for which phit(s —)• s) = 0 we have 

sign(e7v^„U(s;te)) = 0. 

It remains to show that for states in which phit('S —)• s) = 0 that 

sign(e7Vi„7r(o|s; re)) =0, Va G A. 

This property follows immediately from the fact that the policy parametrization is tabular 
and Phit(s —>■ s) 7 ^ 0. □ 
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A.5 Proof of Theorem 0 

Theorem Proof. In order to obtain a contradiction suppose that w* is not a stationary 
point of P(s; te), for each ,s £ S. This means that there exists i G Nn and § £ S such that 
^J'^w\w=w*'^{s',w) 7 ^ 0. We suppose that re) > 0 (an identical argument 

can be used for the case ejV{§] w) < 0 ). As the policy parametrization is value 
consistent it follows that, for each s £ S, 

eJVn,\w=w*V{s-,w) > 0. (51) 

In order to obtain a contradiction we will show that there is no s G 5 for which 
holds with equality. Given this property a contradiction is obtained because it follows that 


w\w=w*^ ®'pi(s) 


.,T 




> 0 , 


contradicting the fact that w* is a local optimum of the objective function. Introducing the 
notation 


S= G 5 I 6j w\w=w*^j oil 

we wish to show that S= = 0. In particular, for a contradiction, suppose that S= ^ 0. This 
means, given the ergodicity of the Markov chain induced by w* and the fact that 5> 7 ^ 0, 
that there exists s £ S= and s' £ 5> such that 

p{s'\s-,w*) = p(s'|s, a)7r(a|s; rr*) > 0. 

a&A 

We now consider the form of lu). In particular, we have 

Vi„P(s;w) = ^ Vi„ 7 r(a|s;it))( i 2 (a, s)+ 7 ^ p{snex.t\s,a)V{sneKt;w)\, 

Snext^S 

+ 7 s; w) E ^('^nextl^? 0 ')V'u;V" (^next; 

As s G 5=, we have by value consistency that 

^(^1 b * 

This means that 

ej^n,\n,=n,*y{s;w) = 7r{a\s] w) ^ ^'(Snext|s,a)e 7 V,„|^„=,„*P(Snext; > 0 . 

CiG.» 4. Snext^*^ 

The inequality follows from the fact that p(s^|s;i(;*) > 0, for some s' £ 5>. This is a 
contradiction of the fact that s= £ S=, so it follows that S= = 0 and for all s G 5 we have 

^io|io=io* ^ b) 

which completes the proof. □ 
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A.6 Proof of Theorem [8] 

Theorem Proof. We shall consider the second Gauss-Newton method, with the result 
for the diagonal Gauss-Newton method following similarly. Given a non-singular affine 
transformation, T G define the objective, U{w) = U{Tw) = U{v), with v = Tw, 

and denote the approximate Hessian of U{w) by 1-L2{w). Given w G W, then it is sufficient 
to show that, 


Twaew = T{w — aH 2 ^{w)VwU{w) ) = t; — = v 


/-!/ 


Va G 


Following calculations analogous to those in Section A.l it can be shown that, 


VwU{w) = '^p^{s,a;Tw)Q{s,a;Tw)'Vw logp{a\s-,Tw), 

s,a 

PL 2 {w) = y^^p.y{s,a-,Tw)Q{s,a-,Tw)V]j, logp{a\s-,Tw). 

s,a 


Using the relations 


V^o log 7r(o|s; Tw) = T'^Vv log7r(a|s; v), 

log7r(a|s;rm) = T'^V^V^ log7r{a\s;v)T, 


it follows that 


v^c/(m) = rTv„u(u), 
n2{w) = T^n2{v)T. 

From this we have, for any a G M"*", that 

TrUnew = — a'Hf^{w)VwU{w)'^ = V — aHf^{v)VvU{v) = Knew, Va G M"*". 


which completes the proof. 


□ 


A.7 Proofs of Theorems [9] and 1101 


We begin by stating a well-known tool for analysis of convergence of iterative optimization 
methods. Given an iterative optimization method, defined through a mapping G : W —)• M"", 
where W C M**, the local convergence at a point w* G W is determined by the spectral 
radius of the Jacobian of G at w*, w\w=w*G{w). This is formalized through the well- 


known Ostrowski’s Theorem, a formal proof of which can be found in Ortega and Rheinboldt 

(19701). 


Lemma 6 (Ostrowski’s Theorem). Suppose that we have a mapping G : W —)• M”, where 
W C M”, such that w* G intClV) is a fixed-point of G and, furthermore, G is Frechet 
differentiable at w*. If the spectral radius of'VG{w*) satisfies p{'VG{w*)) < 1, then w* is 
a point of attraction of G. Furthermore, if p{VG{w*)) > 0, then the convergence towards 
w* is linear and the rate is given by p{VG{w*)). 
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We now prove Theorems and 10 


Theorem]^ (Convergence analysis for the first Gauss-Newton method). Proof. A formal proof 


that Gi is Frechet differentiable can be found in Section 10.2.1 of Ortega and Rheinboldt 


( |1970 ). We now demonstrate the form of \7Gi{w*). For simplicity we shall assume that 
(A.i(iu*) -|-is differentiable. This is not a necessary condition, and a proof that 


does not make this assumption can be found in Section 10.2.1 of Ortega and Rheinboldt 
(1970). We have that, 

Gi{w) =w- q;(Ai(w) + A2{w))~^V^U{w), 
so that S/wGi{w) is given by 

V^Giiw) = 1 - aV^iAiiw) + A2iw))-^V^U{w) - a{Aiiw) + A2iw))-^V(w). 
The fact that Vw\w=w*U{w) = 0 means that 

VGi(m*) = / - a{Aiiw*) + A2iw*))-^n{w*). 


As Hiw*) and + A 2 (m*) are negative-definite, it follows that the eigenvalues of 

(Ai(iu*) -|-A 2 (m*))“^'R(m*) are positive. Hence, 


p{VGi{w*)) = max{|l - aAminI, |1 - aAmaxI}, 


(52) 


with Amin and Amax respectively denoting the minimal and maximal eigenvalues of (Ali(w*)-|- 
A 2 {w*))~^'H{w*). Hence, p{\IGi{w*)) < 1 provided that a G (0,2Amix)) written in 
terms of the spectral radius, a G (0, 2/p{{Ai{w*) + A 2 {w*))~^'H{w*)))■ 

When the policy parametrization is value consistent with respect to the given MDP, 
then from Theorem[^?7i2('it»*) + 'Hl 2 {'^*) — 0) so that 'H{w*) = Ai{w*) -|-Al 2 (w*). It then 
follows that VGi^w*) = — Convergence for this case follows in the same manner. □ 


Theorem 10 (Convergence analysis for the second Gauss-Newton method). Proof. The for- 
mula s (|32[ ) and (34) follow as in the proof of Theorem]^ Using the same approach as in The- 
oremj^ it can be shown that p{VG 2 {w*)) < 1 provided that, a G {0,2/p{'H 2 {w*)~^'H{w*))). 

As ?7(ie*) and 'H 2 {w*) are negative-definite the eigenvalues of 'H. 2 {w*)~^'H{w*) are 
positive. Furthermore, as ^.{w*) = 'Hii{w*) +'H 2 {w*), and, by Lemma 'Hii{w*) is 
positive-definite, it follows that the eigenvalues of 'H 2 {w*)~^'H{w*) all lie in the range 
(0,1). This means that a G (0, 2) is sufficient to ensure that pif\/G 2 {w*)) < 1. 

□ 


A.8 Proof of Theorem 1111 

Theorem |11| . Proof. We use the notation '\/}^Q{wj,wjf) to denote the derivative with 
respect to the first variable of Q, evaluated at {wj, Wk), and similarly V^Q(wj,Wk) for the 
second derivative and V!l/Q(wj,Wk} for the derivative with respect to the second variable 
etc. The idea of the proof is simple and consists of performing a Taylor expansion of 
Vm Q(m, Wk). As Q is assumed to be twice continuously differentiable in the first component 
this Taylor expansion is possible and gives 

Vl)^Q(wk+i,Wk) = Vl/^Q(wk,Wk) + Vm Q(wk, Wk)(wk+i - Wk) -h 0{\\wk+i - Wkf). (53) 
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As Wk+i = argmax^g-^^y Q{w,Wk) it follows that V^Q(wfc+i, lUfc) = 0. This means that, 
upon ignoring higher order terms in — w^, the Taylor expansion (53) can be rewritten 
into the form 

Wfc+i -Wk = -V'^^Q{wk,Wk)~^V^^Q{wk,Wk). (54) 


The proof is completed by observing that 


V^Q{Wk,Wk) = H2{Wk). 


The second statement follows because in the case where the log-policy is quadratic the 
higher order terms in the Taylor expansion vanish. □ 


A.9 Proof of Theorem 1121 

Theorem |12| , Proof. In the EM-algorithm the update of the policy parameters takes the 
form 

GEM(it>fc) = argmax.^gyy Q{w,Wk), 
where the function Q{w,w') is given by 


Q{w,w') 


p^{s,a;w')Q{s,a;w') 

{s,a)£SxA 


log7r(a|s; w) 


Note that Q is a two parameter function, where the first parameter occurs inside the bracket, 
while the second parameter occurs outside the bracket. Also note that Q{w,w') satisfies 
the following identities 


V20Q(m,m') 


y] p.y{s,a‘,w')Q{s,a]w' 
{s,a)GSxA 


V^„log7^(a|s;^n) , 
V^„vC|,log7r(a|s;^n) , 
y (^P-y(s, a; w')Q(s, a; w')^ log 7r(a|ss; w). 


y] Pj(s, a; w')Q(s, a; w' 
(s,a)GSxA 


(s,a)eSxA 


Here we have used the notation to denote the derivative with respect to the first 
parameter and the derivative with respect to the second parameter. Note that when we 
set m in the first two of these terms we have w) = Vwl7(w), V^^Q(w, w) = 

'H 2 {w). a key identity that we need for the proof is that w) = %i{w) +'Hi 2 {w) + 

^J2 (w). This follows from the observation that Vu,f7(m) = V^^Q(w, w), so that 


Vi^V^[/(w) = 


V^'^Q{w,w) 


V^°Q{w,w) + V^^Q{w,w), 


so that 

'Hi(m) 'Hi 2 {w) + = '^{w) - 'H 2 {w) = w) + w) - V‘^^Q{w, w), 
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as claimed. 

Now, to calculate the matrix VGem('U^*) we perform a Taylor series expansion of 
in both parameters around the point {w*,w*), and evaluated at {wk+i,Wk), 

which gives 

+ Vl^Q{w*,w*){wk - w*) + ... 

As w* is a local optimum of U{w) we have that = 0. Furthermore, as 

the sequence {wk}keN was generated by the EM-algorithm, we have, for each k G N, 
that Wk+i = argmax.^gyy Q{w,Wk), which implies that S/l^Q{wk+i,Wk) = 0. Finally, as 
V‘^^Q{w*,w*) = 'H 2 {w*) and m*) = we have 

0 = n2{w*){wk+i - w*) + {ni{w*) + ni2{w*) + nj2iw*)){wk -w*) + ... 

Using the fact that w^+i = GemCk^/c) and w* = Gem('Ii’*), taking the limit fc —)• oo gives 

0 = n2{w*)v^GEM{w*) + ni{w*) + ni2{w*) + nj2{w*), 


so that 

V^Gem(^<^*) = -n2\w*){ni{w*)+'Hi2{w*) + 'Hj2{w*)) = I 

In the case where the policy parametrization value consistent with respect to the given 
MDP then we have 'Hi 2 {'w*) +'Hi 2 {w*)~^ = 0, so that VwGem{'w*) = I — {w*)Ai{w*). 
The rest of the proof follows from the result in Theorem [T^ when considering a = \. □ 


Appendix B. Further Details for Estimation of Preconditioners and the 
Gauss-Newton Update Direction 

B.l Recurrent State Search Direction Evaluation for Second Ganss-Newton 
Method 


In Williams (1992) a sampling algorithm was provided for estimating the gradient of an 


infinite horizon MDP with average rewards. This algorithm makes use of a recurrent state, 
which we denote by s*. In Algorithm we detail a straightforward extension of this al¬ 
gorithm to the estimation the approximate Hessian, T-L 2 {w), in this MDP framework. The 
analogous algorithm for the estimation of the diagonal matrix, T> 2 {w), follows similarly. In 
Algorithm we make use of an eligibility trace for both the gradient and the approximate 
Hessian, which we denote by and respectively. The estimates (up to a positive scalar) 
of the gradient and the approximate Hessian are denoted by and respectively. 


B.2 Inversion of Preconditioning Matrices 

A computational bottleneck of the Newton method is the inversion of the Hessian matrix, 
which scales with 0{iA). In a standard application of the Newton method this inversion is 
performed during each iteration, and in large parameter systems this becomes prohibitively 
costly. We now consider the inversion of the preconditioning matrix in proposed Gauss- 
Newton methods. 
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Algorithm 2: Recurrent state sampling algorithm to estimate the search direction 
of the second Gauss-Newton method. The algorithm is applicable to Markov decision 
processes with an infinite planning horizon and average rewards. 

Input: Policy parameter, w G W, 

Number of restarts, N gN. 

Sample a state from the initial state distribution: 

Si ~ pi(-). 

for i = 1,., N do 

Given the current state, sample an action from the policy: 

at ~ 7r(-|si;rr). 

if St ^ s*, then 

update the eligibility traces: 

$1 ^ $1 + Vw\og7r{at\st;w) ^ + log TT{at\st-,w) 

else 

reset the eligibility traces: 

= 0 , = 0 . 

end 

Update the estimates of the VwU{w) and 'H 2 {w): 

G- -|- R{at, g- A^ -|- R{at, st)^^. 

Sample state from the transition dynamics: 

st+i ^p{-\at,st). 

Update time-step, t t -|- 1. 

end 

return A^ and A^, which, up to a positive multiplicative constant, are estimates of 
VwU{w) and 'H 2 {w). 


Firstly, in the diagonal forms of the Gauss-Newton methods the preconditioning matrix 
is diagonal, so that the inversion of this matrix is trivial and scales linearly in the number of 
parameters. In general the preconditioning matrix of the full Gauss-Newton methods will 
have no form of sparsity, and so no computational savings will be possible when inverting 
the preconditioning matrix. There is, however, a source of sparsity that allows for the 
efficient inversion of %2 in certain cases of interest. In particular, any product structure 
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(with respect to the control parameters) in the model of the agent’s behaviour will lead to 
sparsity in 7^2 • For example, in partially observable Markov decision processes in which the 


behaviour of the agent is modeled through a finite state controller (Meuleau et al. 1 


there are three functions that are to be optimized, the initial belief distribution, the belief 
transition dynamics and the policy. In this case the dynamics of the system are given by. 


p{s', o , b', a'|s, o, b, a; v, w) = p{s'\s, a)p{o\s')p{b'\b, o'] v)7r{a\b', o'] w), 


in which o G O is an observation from a finite observation space, O, and b € B is the 
belief state from a finite belief space, B. The initial belief is given by the initial belief 
distribution, p{b\o]u). The parameters to be optimized in this system are u, v and w. It 
can be seen that in this system % 2 {u,v,w) is block-diagonal (across the parameters u, v 
and w) and the matrix inversion can be performed more efficiently by inverting each of the 
block matrices individually. By contrast, the Hessian %[u,v,w) does not exhibit any such 
sparsity properties. 


B.3 A Hessian-free Conjugate Gradient Method for Fast Gauss-Newton 
Ascent 

In general, the matrix inversion required in the full Gauss-Newton methods scales cubically 
in the number of policy parameters, which will be prohibitively expensive in large parameter 
systems. It is possible, however, to approximate the search direction of the second Gauss- 
Newton method at a computational cost that is linear in the number of policy parameters. 
We focus on this form of the Gauss-Newton method for the remainder of this section. 
These computational savings are achieved through an application of the conjugate-gradient 


algorithm (Hestenes and Stiefel, 1952), along with a Hessian-free approximation (Nocedal 


and Wrightl 2006) to a matrix-vector product that occurs within the conjugate-gradient 


algorithm. 

It can be seen that the search direction of the Gauss-Newton method at tc G W is given 
by the solution to the linear system, 


- 'H2{w)x = V.wU{w), 


(55) 


The conjugate-gradient algorithm ( Hestenes and Stiefel| 1952) is an iterative algorithm for 
solving linear systems. The algorithm maintains an estimate to the solution of the linear 
system during the course of the algorithm. We denote the estimate at the iteration 
by £Cfc. The first approximation we propose is the use of x^, for some given /c G N, as 
an approximation to the search direction of the Gauss-Newton method. As —'H 2 {w) is 
positive-definite the conjugate-gradient algorithm is guaranteed to find the exact solution 


of the linear system (55) within at most n iterations. Furthermore, when xq is appropriately 
selected, then Xk will be an ascent direction for all k G N„. This property is guaranteed 
when, for instance, xq = V.ujU{w). Each iteration of the conjugate-gradient algorithm 
scales quadratically in n. If x^ is used in place of x in the Gauss-Newton method, then the 
computational complexity will scale as 0{k'n?). When k n, therefore, the computational 
complexity of such an approach will be far less than the standard application of the Gauss- 
Newton method. When n is large, however, this quadratic scaling in n will still be too 
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prohibitive and for this case we shall now consider an additional level of approximation in 
order to reduce the computational costs still further. 

The computational bottleneck in each iteration of the conjugate-gradient algorithm is 
a matrix-vector product, that scales quadratically in the size of the linear system. In the 
case of the Gauss-Newton method, the matrix-vector product in the iteration of the 
conjugate gradient algorithm takes the form, 

-'H2{w)pk = - ^ p^{s,a]w)Q{s,a;w)'^wVl,logTT{a\s]w)pk, A: G Nn, (56) 


in which pk is the A;*^ conjugate direction found during the conjugate-gradient algorithm. 


The matrix-vector product in (56) can be equivalently viewed as a weighted summation 


of matrix-vector products which, for each state-action pair, (s, a) G 5 x we have the 
matrix-vector product, log7r(a|s; This perspective of (56) allows the use of 


standard finite-difference approximations to efficiently approximate each of these matrix- 
vector products, and thus (56) itself. In particular, introducing the scalar, e G M"'', with 
e « 0, we have that 


Vi„vC„log7r(a|s;m)pfc ss -^ ( log7r(o|s; mepfc) - log 7r(a|s; m) 


(57) 


for each (s,a) G 5 x Using (57) in (56) gives 


p^{s, a; w)Q{s, a; w) log vr(a|s; w) - Vw log 7r(a| s; w + epk) 


-n2{w)pk^^ 

{s,a)£SxA 

(58) 

The use of this approximation removes the necessity to either construct 'H 2 {w) or to perform 
the matrix-vector product, and each iteration of conjugate-gradients now has a computa¬ 
tional complexity that is linear in the dimension of the parameter space. Using k iterations 
of conjugate-gradients to approximate the search direction results in a computational cost 
that scales as 0{kn). We shall refer to the use of these two approximations (i.e., the use 
of the conjugate-gradient algorithm to approximately solve the linear system (|55l) and the 


use of the finite-difference approximation (58)) as the conjugate-gradient Gauss-Newton 
method. A summary of the algorithm is given in Algorithm 

The use of the conjugate-gradient algorithm and the finite-difference approximation 
(58), are based upon methods used in Hessian-free algorithms ( [Nocedal and Wright , 2006) 
from the numerical optimization literature. In the case of Markov decision processes, the 
conjugate-gradient algorithm would be used within an Hessian-free algorithm to solve the 
linear system, 

- VwV^U{w)x = VwU{w) . (60) 

A finite-difference approximation is also applied in Hessian-free methods, in this case taking 
the form. 


-V{w)pk « - 


E 


p^{s, a; w)Q{s, a; w)V.w log7r(a|s; 


w 


(61) 


Y P'r{s,a]w + €pk)Q{s,a;w + epk)Vw^og7T{a\s;w + epk) 
(s,a)eSxA 
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Algorithm 3: The Conjugate-Gradient Gauss-Newton Method 
Input: Initial vector of policy parameters, Wq G W. 

Set iteration counter, k -i— 0. 

repeat 

Galculate the gradient of the objective at the current point in the parameter 
space, Vw=wkU{w). 

Use the conjugate-gradient algorithm to approximately solve the linear system. 


-'}i2{'Wk)x = Vw=WkU{w), (59) 

using some given stopping criterion in the conjugate-gradient algorithm, and 
using the finite-difference approximation (58) to approximate the matrix-vector 
product (56). 

Update policy parameters, w^+i = -I- in which xj. is the approximate 

solution of the linear system (59) and a G M"*" is the step size. 

Update iteration counter, k ^ k + 1. 

until Convergence of the policy parameters; 


return 


Given the similarities between the conjugate-gradient Gauss-Newton method and Hessian- 
free methods, it is worth noting some important differences between the two algorithms. 
Firstly, as the Hessian is not necessarily negative-definite, it is not necessarily the case that 
the conjugate-gradient algorithm will be able to solve the linear system (60). It is no longer 
the case that x^, k G N, will be an ascent direction of the objective function, regardless 


of the initialization of the conjugate-gradient algorithm (Nocedal and Wright, 2006). Ad¬ 


ditionally, comparing the finite-difference approximation (58) with the finite-difference ap¬ 
proximation (61) it can be seen that in the rightmost term of (58) the discounted occupancy 
distribution and the state-action value function depend on the current policy parameters, 
w gW, while in the corresponding term in (61) these quantities depend on the perturbed 
policy parameters, w + epk G W. Terms such as the state-action value function cannot 
be calculated exactly in most large-scale MDPs of interest and instead must be estimated. 
In a standard application of a Hessian-free method, therefore, it would be necessary to 
re-estimate such quantities in each iteration of the conjugate-gradient algorithm. By con¬ 
trast, in the conjugate-gradient Gauss-Newton method the same estimate of the state-action 
value function and discounted occupancy marginals can be used in all of the iterations of 
the conjugate-gradient algorithm. In policy gradient algorithms estimating such terms typ¬ 
ically forms an expensive part of the overall algorithm, which means that each iteration of 
the conjugate-gradient Gauss-Newton method will be more computationally efficient than 
Hessian-free methods. It also means that while it may appear that the approximation (58) 
should have the same cost as two gradient evaluations, it will be typically be cheaper than 
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this in practice. Furthermore, it means that there is an additional level of approximation in 
Hessian-free methods that is not present in the conjugate-gradient Gauss-Newton method. 

Additionally, by considering the Fisher information matrix that takes the form (14), the 
approach presented in this section could also be applied in the natural gradient framework. 
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